ML HW-SW Codesign

Author

Chao Ma

Published

March 9, 2026

Notes on efficient AI systems where model design, compression, and hardware architecture are developed together.

Efficient AI Lecture 18: Diffusion Models Diffusion models from forward noising, denoising training, conditional generation, latent diffusion, image editing, and personalization to DDIM, distillation, sparsity, quantization, and distributed sampling.

Efficient AI Lecture 17: Efficient GANs, Video, and Point Cloud Application-specific efficient AI for GANs, video, and point clouds: GAN Compression, AnyCost GAN, Differentiable Augmentation, TSM, PVCNN/SPVCNN, and BEVFusion.

Efficient AI Lecture 16: Vision Transformer Vision Transformers from patch tokenization and window attention to sparse and linear attention, EfficientViT, self-supervised ViT training, masked autoencoders, and autoregressive image generation.

Efficient AI Lecture 15: Long-Context LLM Long-context LLMs from RoPE interpolation and LongLoRA to long-context evaluation, StreamingLLM attention sinks, DuoAttention, Quest query-aware sparsity, and Mamba state-space models.

Efficient AI Lecture 14: LLM Post-Training LLM post-training from SFT, RLHF, and DPO to PEFT methods such as BitFit, adapters, prompt tuning, prefix tuning, LoRA, QLoRA, Bit-Delta, multimodal LLMs, prompting, chain-of-thought, and RAG.

Efficient AI Lecture 13: LLM Deployment Techniques LLM serving techniques from SmoothQuant and AWQ to INT4 kernels, activation-aware pruning, MoE, PagedAttention, FlashAttention, speculative decoding, and batching.

Efficient AI Lecture 12: Transformer and LLM Transformer and LLM design from tokenization, embeddings, attention, masking, FFNs, and positional encodings to encoder/decoder variants, KV-cache optimization, grouped-query attention, modern LLM architectures, and multimodal extensions.

Efficient AI Lecture 11: TinyEngine TinyEngine makes neural network inference practical on microcontrollers through memory-aware kernels, loop locality, SIMD-aware execution, im2col avoidance, in-place depth-wise convolution, and layout choices such as NHWC.

Efficient AI Lecture 10: MCUNet and TinyML TinyML under microcontroller memory limits: TinyNAS search-space specialization, Flash and SRAM constraints, CNN activation bottlenecks, patch-based inference, and network redistribution.

Efficient AI Lecture 9: Knowledge Distillation How a compact student model learns from a larger teacher through soft targets, temperature, intermediate features, self distillation, online distillation, and task-specific distillation.

Efficient AI Lecture 8: Neural Architecture Search (Part II) Accuracy estimation, weight inheritance, hypernetworks, ProxylessNAS, Once-for-All networks, zero-shot NAS, and joint neural network, mapping, and accelerator search.

Efficient AI Lecture 7: Neural Architecture Search (Part I) Classic efficient building blocks, cell-level NAS search spaces, elastic scaling dimensions, and the main architecture-search strategies from grid search to RL, differentiable search, and evolution.

Efficient AI Lecture 6: Quantization (Part II) Post-training quantization granularity, clipping and calibration, AdaRound, QAT with STE, and binary/ternary quantization methods for pushing precision lower without losing control.

Efficient AI Lecture 5: Quantization (Part I) Why low-bit arithmetic saves energy, how numeric formats trade off range and precision, and how K-means and linear quantization connect compression to hardware-friendly integer compute.

Efficient AI Lecture 4: Pruning and Sparsity (Part II) Layer-wise pruning ratios, automatic pruning with AMC and NetAdapt, fine-tuning after pruning, and the hardware systems that turn sparsity into real speed and energy gains.

Efficient AI Lecture 3: Pruning and Sparsity (Part 1) Why memory dominates energy, how pruning is formulated with an L0 constraint, the hardware tradeoff between unstructured and structured sparsity, and the main pruning criteria from magnitude to second-order and regression-based methods.

Efficient AI Lecture 1: Introduction Why efficient AI needs both algorithmic compression and hardware specialization: Deep Compression, EIE, MCUNetV3, efficient LMs, and the hardware trends driving co-design.

--- title: "ML HW-SW Codesign" author: "Chao Ma" date: "2026-03-09" --- Notes on efficient AI systems where model design, compression, and hardware architecture are developed together. --- ::: {.content-grid} ::: {.content-card} **[Efficient AI Lecture 18: Diffusion Models](efficient-ai-lecture-18-diffusion-models.qmd)** Diffusion models from forward noising, denoising training, conditional generation, latent diffusion, image editing, and personalization to DDIM, distillation, sparsity, quantization, and distributed sampling. ::: ::: {.content-card} **[Efficient AI Lecture 17: Efficient GANs, Video, and Point Cloud](efficient-ai-lecture-17-gans-video-pointcloud.qmd)** Application-specific efficient AI for GANs, video, and point clouds: GAN Compression, AnyCost GAN, Differentiable Augmentation, TSM, PVCNN/SPVCNN, and BEVFusion. ::: ::: {.content-card} **[Efficient AI Lecture 16: Vision Transformer](efficient-ai-lecture-16-vision-transformer.qmd)** Vision Transformers from patch tokenization and window attention to sparse and linear attention, EfficientViT, self-supervised ViT training, masked autoencoders, and autoregressive image generation. ::: ::: {.content-card} **[Efficient AI Lecture 15: Long-Context LLM](efficient-ai-lecture-15-long-context-llm.qmd)** Long-context LLMs from RoPE interpolation and LongLoRA to long-context evaluation, StreamingLLM attention sinks, DuoAttention, Quest query-aware sparsity, and Mamba state-space models. ::: ::: {.content-card} **[Efficient AI Lecture 14: LLM Post-Training](efficient-ai-lecture-14-llm-post-training.qmd)** LLM post-training from SFT, RLHF, and DPO to PEFT methods such as BitFit, adapters, prompt tuning, prefix tuning, LoRA, QLoRA, Bit-Delta, multimodal LLMs, prompting, chain-of-thought, and RAG. ::: ::: {.content-card} **[Efficient AI Lecture 13: LLM Deployment Techniques](efficient-ai-lecture-13-llm-deployment-techniques.qmd)** LLM serving techniques from SmoothQuant and AWQ to INT4 kernels, activation-aware pruning, MoE, PagedAttention, FlashAttention, speculative decoding, and batching. ::: ::: {.content-card} **[Efficient AI Lecture 12: Transformer and LLM](efficient-ai-lecture-12-transformer-and-llm.qmd)** Transformer and LLM design from tokenization, embeddings, attention, masking, FFNs, and positional encodings to encoder/decoder variants, KV-cache optimization, grouped-query attention, modern LLM architectures, and multimodal extensions. ::: ::: {.content-card} **[Efficient AI Lecture 11: TinyEngine](efficient-ai-lecture-11-tinyengine.qmd)** TinyEngine makes neural network inference practical on microcontrollers through memory-aware kernels, loop locality, SIMD-aware execution, im2col avoidance, in-place depth-wise convolution, and layout choices such as NHWC. ::: ::: {.content-card} **[Efficient AI Lecture 10: MCUNet and TinyML](efficient-ai-lecture-10-mcunet-and-tinyml.qmd)** TinyML under microcontroller memory limits: TinyNAS search-space specialization, Flash and SRAM constraints, CNN activation bottlenecks, patch-based inference, and network redistribution. ::: ::: {.content-card} **[Efficient AI Lecture 9: Knowledge Distillation](efficient-ai-lecture-09-knowledge-distillation.qmd)** How a compact student model learns from a larger teacher through soft targets, temperature, intermediate features, self distillation, online distillation, and task-specific distillation. ::: ::: {.content-card} **[Efficient AI Lecture 8: Neural Architecture Search (Part II)](efficient-ai-lecture-08-neural-architecture-search-part-2.qmd)** Accuracy estimation, weight inheritance, hypernetworks, ProxylessNAS, Once-for-All networks, zero-shot NAS, and joint neural network, mapping, and accelerator search. ::: ::: {.content-card} **[Efficient AI Lecture 7: Neural Architecture Search (Part I)](efficient-ai-lecture-07-neural-architecture-search-part-1.qmd)** Classic efficient building blocks, cell-level NAS search spaces, elastic scaling dimensions, and the main architecture-search strategies from grid search to RL, differentiable search, and evolution. ::: ::: {.content-card} **[Efficient AI Lecture 6: Quantization (Part II)](efficient-ai-lecture-06-quantization-part-2.qmd)** Post-training quantization granularity, clipping and calibration, AdaRound, QAT with STE, and binary/ternary quantization methods for pushing precision lower without losing control. ::: ::: {.content-card} **[Efficient AI Lecture 5: Quantization (Part I)](efficient-ai-lecture-05-quantization-part-1.qmd)** Why low-bit arithmetic saves energy, how numeric formats trade off range and precision, and how K-means and linear quantization connect compression to hardware-friendly integer compute. ::: ::: {.content-card} **[Efficient AI Lecture 4: Pruning and Sparsity (Part II)](efficient-ai-lecture-04-pruning-and-sparsity-part-2.qmd)** Layer-wise pruning ratios, automatic pruning with AMC and NetAdapt, fine-tuning after pruning, and the hardware systems that turn sparsity into real speed and energy gains. ::: ::: {.content-card} **[Efficient AI Lecture 3: Pruning and Sparsity (Part 1)](efficient-ai-lecture-03-pruning-and-sparsity-part-1.qmd)** Why memory dominates energy, how pruning is formulated with an L0 constraint, the hardware tradeoff between unstructured and structured sparsity, and the main pruning criteria from magnitude to second-order and regression-based methods. ::: ::: {.content-card} **[Efficient AI Lecture 1: Introduction](efficient-ai-lecture-01-introduction.qmd)** Why efficient AI needs both algorithmic compression and hardware specialization: Deep Compression, EIE, MCUNetV3, efficient LMs, and the hardware trends driving co-design. ::: :::