ML HW-SW Codesign

Author

Chao Ma

Published

March 9, 2026

Notes on efficient AI systems where model design, compression, and hardware architecture are developed together.

Efficient AI Lecture 13: LLM Deployment Techniques LLM serving techniques from SmoothQuant and AWQ to INT4 kernels, activation-aware pruning, MoE, PagedAttention, FlashAttention, speculative decoding, and batching.

Efficient AI Lecture 12: Transformer and LLM Transformer and LLM design from tokenization, embeddings, attention, masking, FFNs, and positional encodings to encoder/decoder variants, KV-cache optimization, grouped-query attention, modern LLM architectures, and multimodal extensions.

Efficient AI Lecture 11: TinyEngine TinyEngine makes neural network inference practical on microcontrollers through memory-aware kernels, loop locality, SIMD-aware execution, im2col avoidance, in-place depth-wise convolution, and layout choices such as NHWC.

Efficient AI Lecture 10: MCUNet and TinyML TinyML under microcontroller memory limits: TinyNAS search-space specialization, Flash and SRAM constraints, CNN activation bottlenecks, patch-based inference, and network redistribution.

Efficient AI Lecture 9: Knowledge Distillation How a compact student model learns from a larger teacher through soft targets, temperature, intermediate features, self distillation, online distillation, and task-specific distillation.

Efficient AI Lecture 8: Neural Architecture Search (Part II) Accuracy estimation, weight inheritance, hypernetworks, ProxylessNAS, Once-for-All networks, zero-shot NAS, and joint neural network, mapping, and accelerator search.

Efficient AI Lecture 7: Neural Architecture Search (Part I) Classic efficient building blocks, cell-level NAS search spaces, elastic scaling dimensions, and the main architecture-search strategies from grid search to RL, differentiable search, and evolution.

Efficient AI Lecture 6: Quantization (Part II) Post-training quantization granularity, clipping and calibration, AdaRound, QAT with STE, and binary/ternary quantization methods for pushing precision lower without losing control.

Efficient AI Lecture 5: Quantization (Part I) Why low-bit arithmetic saves energy, how numeric formats trade off range and precision, and how K-means and linear quantization connect compression to hardware-friendly integer compute.

Efficient AI Lecture 4: Pruning and Sparsity (Part II) Layer-wise pruning ratios, automatic pruning with AMC and NetAdapt, fine-tuning after pruning, and the hardware systems that turn sparsity into real speed and energy gains.

Efficient AI Lecture 3: Pruning and Sparsity (Part 1) Why memory dominates energy, how pruning is formulated with an L0 constraint, the hardware tradeoff between unstructured and structured sparsity, and the main pruning criteria from magnitude to second-order and regression-based methods.

Efficient AI Lecture 1: Introduction Why efficient AI needs both algorithmic compression and hardware specialization: Deep Compression, EIE, MCUNetV3, efficient LMs, and the hardware trends driving co-design.

--- title: "ML HW-SW Codesign" author: "Chao Ma" date: "2026-03-09" --- Notes on efficient AI systems where model design, compression, and hardware architecture are developed together. --- ::: {.content-grid} ::: {.content-card} **[Efficient AI Lecture 13: LLM Deployment Techniques](efficient-ai-lecture-13-llm-deployment-techniques.qmd)** LLM serving techniques from SmoothQuant and AWQ to INT4 kernels, activation-aware pruning, MoE, PagedAttention, FlashAttention, speculative decoding, and batching. ::: ::: {.content-card} **[Efficient AI Lecture 12: Transformer and LLM](efficient-ai-lecture-12-transformer-and-llm.qmd)** Transformer and LLM design from tokenization, embeddings, attention, masking, FFNs, and positional encodings to encoder/decoder variants, KV-cache optimization, grouped-query attention, modern LLM architectures, and multimodal extensions. ::: ::: {.content-card} **[Efficient AI Lecture 11: TinyEngine](efficient-ai-lecture-11-tinyengine.qmd)** TinyEngine makes neural network inference practical on microcontrollers through memory-aware kernels, loop locality, SIMD-aware execution, im2col avoidance, in-place depth-wise convolution, and layout choices such as NHWC. ::: ::: {.content-card} **[Efficient AI Lecture 10: MCUNet and TinyML](efficient-ai-lecture-10-mcunet-and-tinyml.qmd)** TinyML under microcontroller memory limits: TinyNAS search-space specialization, Flash and SRAM constraints, CNN activation bottlenecks, patch-based inference, and network redistribution. ::: ::: {.content-card} **[Efficient AI Lecture 9: Knowledge Distillation](efficient-ai-lecture-09-knowledge-distillation.qmd)** How a compact student model learns from a larger teacher through soft targets, temperature, intermediate features, self distillation, online distillation, and task-specific distillation. ::: ::: {.content-card} **[Efficient AI Lecture 8: Neural Architecture Search (Part II)](efficient-ai-lecture-08-neural-architecture-search-part-2.qmd)** Accuracy estimation, weight inheritance, hypernetworks, ProxylessNAS, Once-for-All networks, zero-shot NAS, and joint neural network, mapping, and accelerator search. ::: ::: {.content-card} **[Efficient AI Lecture 7: Neural Architecture Search (Part I)](efficient-ai-lecture-07-neural-architecture-search-part-1.qmd)** Classic efficient building blocks, cell-level NAS search spaces, elastic scaling dimensions, and the main architecture-search strategies from grid search to RL, differentiable search, and evolution. ::: ::: {.content-card} **[Efficient AI Lecture 6: Quantization (Part II)](efficient-ai-lecture-06-quantization-part-2.qmd)** Post-training quantization granularity, clipping and calibration, AdaRound, QAT with STE, and binary/ternary quantization methods for pushing precision lower without losing control. ::: ::: {.content-card} **[Efficient AI Lecture 5: Quantization (Part I)](efficient-ai-lecture-05-quantization-part-1.qmd)** Why low-bit arithmetic saves energy, how numeric formats trade off range and precision, and how K-means and linear quantization connect compression to hardware-friendly integer compute. ::: ::: {.content-card} **[Efficient AI Lecture 4: Pruning and Sparsity (Part II)](efficient-ai-lecture-04-pruning-and-sparsity-part-2.qmd)** Layer-wise pruning ratios, automatic pruning with AMC and NetAdapt, fine-tuning after pruning, and the hardware systems that turn sparsity into real speed and energy gains. ::: ::: {.content-card} **[Efficient AI Lecture 3: Pruning and Sparsity (Part 1)](efficient-ai-lecture-03-pruning-and-sparsity-part-1.qmd)** Why memory dominates energy, how pruning is formulated with an L0 constraint, the hardware tradeoff between unstructured and structured sparsity, and the main pruning criteria from magnitude to second-order and regression-based methods. ::: ::: {.content-card} **[Efficient AI Lecture 1: Introduction](efficient-ai-lecture-01-introduction.qmd)** Why efficient AI needs both algorithmic compression and hardware specialization: Deep Compression, EIE, MCUNetV3, efficient LMs, and the hardware trends driving co-design. ::: :::