Goodfellow Deep Learning — Chapter 7.6: Semi-Supervised Learning

deep learning

regularization

semi-supervised learning

Author

Chao Ma

Published

October 22, 2025

Overview

When labeled data is scarce, semi-supervised learning leverages both labeled and unlabeled data to improve model performance. This approach combines:

Generative modeling to learn data distribution $P(x)$
Supervised classification to learn $P(y|x)$
Joint optimization that balances both objectives

1. The Problem: Limited Labeled Data

In many real-world scenarios:

Labeled data is expensive to obtain (requires human annotation)
Unlabeled data is abundant and cheap
Models trained only on limited labeled data tend to overfit

Solution: Use unlabeled data to learn better representations and regularize the model.

2. Two Learning Objectives

Generative Model (Unsupervised)

Objective: Maximize the probability of generating correct inputs \[ P(x) \]

What this learns:

The underlying distribution of the data
Useful representations of the input space
Structure and patterns in unlabeled data

Classification Model (Supervised)

Objective: Maximize the probability of correct predictions given inputs \[ P(y|x) \]

What this learns:

Decision boundaries between classes
Task-specific features
Direct mapping from inputs to labels

3. Joint Learning Objective

Combined loss function: \[ \mathcal{L} = -\log P(y|x) - \lambda \log P(x) \]

where:

First term: Supervised loss (classification accuracy)
Second term: Unsupervised loss (generative modeling)
$\lambda$: Trade-off parameter controlling the balance

Interpretation:

The model must simultaneously:
1. Predict labels correctly (supervised term)
2. Model the data distribution well (unsupervised term)
The unsupervised term acts as regularization, preventing overfitting to the small labeled set

4. Why This Works

Key insight: When the model learns how to represent $P(x)$, it discovers where the data is dense. Decision boundaries should avoid cutting through high-density regions — they should instead pass through low-density areas between clusters.

Geometric interpretation:

Learning $P(x)$ reveals the natural clustering structure of the data
Classification boundaries are encouraged to lie in low-density regions
This prevents the decision boundary from crossing through dense data manifolds

Benefits:

Better representations: Unlabeled data reveals the structure of the input space
Cluster assumption: Decision boundaries naturally form between clusters, not through them
Regularization: The generative term prevents the classifier from focusing only on labeled examples
Data efficiency: Can achieve high accuracy with significantly fewer labeled samples

Example:

With only 10% labeled data, semi-supervised learning can match the performance of fully supervised learning with 100% labels

5. Real-World Applications

Note: The following content is generated by ChatGPT.

Domain	Task / Problem	Unlabeled Data Used	Method Family	Real-World Benefit	Reference
Image Recognition	Classifying natural images (CIFAR-10, ImageNet-100)	Millions of unlabeled web images	Consistency Regularization (FixMatch, Mean Teacher)	+15–25% accuracy with 10× fewer labeled samples	Sohn et al., FixMatch, 2020
Medical Imaging	Tumor or lesion segmentation (MRI / CT)	Thousands of unlabeled scans	Generative / Consistency Hybrid (VAE, U-Net)	~80% annotation cost reduction; works well with rare cases	Bai et al., MedIA, 2019
Speech Recognition	Automatic speech recognition (ASR)	Large amounts of raw audio	Representation Learning (wav2vec 2.0)	Matches full supervision using <10% labeled data	Baevski et al., wav2vec 2.0, 2020
Natural Language Processing	Text classification, sentiment analysis	Billions of unlabeled sentences	Self-Supervised Pretraining (BERT, RoBERTa)	Massive improvement in downstream $P(y \mid x)$ tasks	Devlin et al., BERT, 2018
Autonomous Driving	Scene understanding, lane detection	Continuous unlabeled video streams	Consistency + Pseudo-Labeling	Robust to lighting/weather; reduces manual labels	French et al., 2020
Financial Fraud Detection	Detecting anomalous transactions	Transaction logs without labels	Generative Modeling (VAE / GAN)	Learns normal patterns → better anomaly detection	Xu et al., KDD, 2018
Recommendation Systems	Predicting user preferences	User–item logs without explicit feedback	Representation Learning (Autoencoder / Contrastive)	Improves cold-start and leverages implicit signals	—

6. Common Semi-Supervised Learning Methods

Note: The following content is generated by ChatGPT.

Consistency Regularization

Idea: Model should produce similar predictions for perturbed versions of the same input
Examples: FixMatch, Mean Teacher, Virtual Adversarial Training

Pseudo-Labeling

Idea: Use model’s confident predictions on unlabeled data as “soft labels”
Process: Train → predict on unlabeled → retrain with pseudo-labels

Generative Models

Idea: Learn $P(x)$ and $P(y|x)$ jointly
Examples: VAE, GAN-based approaches

Self-Supervised Pretraining

Idea: Pretrain on unlabeled data with pretext tasks, then fine-tune on labeled data
Examples: BERT (masked language modeling), wav2vec 2.0 (contrastive learning)

Source: Deep Learning Book (Goodfellow et al.), Chapter 7.6

--- title: "Goodfellow Deep Learning — Chapter 7.6: Semi-Supervised Learning" author: "Chao Ma" date: "2025-10-22" categories: [deep learning, regularization, semi-supervised learning] --- ## Overview When labeled data is scarce, **semi-supervised learning** leverages both labeled and unlabeled data to improve model performance. This approach combines: 1. **Generative modeling** to learn data distribution $P(x)$ 2. **Supervised classification** to learn $P(y|x)$ 3. **Joint optimization** that balances both objectives --- ## 1. The Problem: Limited Labeled Data In many real-world scenarios: - Labeled data is **expensive** to obtain (requires human annotation) - Unlabeled data is **abundant** and cheap - Models trained only on limited labeled data tend to overfit **Solution**: Use unlabeled data to learn better representations and regularize the model. --- ## 2. Two Learning Objectives ### Generative Model (Unsupervised) **Objective**: Maximize the probability of generating correct inputs $$ P(x) $$ **What this learns**: - The underlying distribution of the data - Useful representations of the input space - Structure and patterns in unlabeled data ### Classification Model (Supervised) **Objective**: Maximize the probability of correct predictions given inputs $$ P(y|x) $$ **What this learns**: - Decision boundaries between classes - Task-specific features - Direct mapping from inputs to labels --- ## 3. Joint Learning Objective **Combined loss function**: $$ \mathcal{L} = -\log P(y|x) - \lambda \log P(x) $$ where: - First term: **Supervised loss** (classification accuracy) - Second term: **Unsupervised loss** (generative modeling) - $\lambda$: **Trade-off parameter** controlling the balance **Interpretation**: - The model must simultaneously: 1. Predict labels correctly (supervised term) 2. Model the data distribution well (unsupervised term) - The unsupervised term acts as **regularization**, preventing overfitting to the small labeled set ![Semi-Supervised Learning](https://raw.githubusercontent.com/ickma2311/foundations/main/deep_learning/chapter7/7.6/semi-supervised.png) --- ## 4. Why This Works **Key insight**: When the model learns how to represent $P(x)$, it discovers where the data is dense. Decision boundaries should avoid cutting through high-density regions — they should instead pass through low-density areas between clusters. **Geometric interpretation**: - Learning $P(x)$ reveals the natural clustering structure of the data - Classification boundaries are encouraged to lie in low-density regions - This prevents the decision boundary from crossing through dense data manifolds **Benefits**: 1. **Better representations**: Unlabeled data reveals the structure of the input space 2. **Cluster assumption**: Decision boundaries naturally form between clusters, not through them 3. **Regularization**: The generative term prevents the classifier from focusing only on labeled examples 4. **Data efficiency**: Can achieve high accuracy with significantly fewer labeled samples **Example**: - With only 10% labeled data, semi-supervised learning can match the performance of fully supervised learning with 100% labels --- ## 5. Real-World Applications **Note**: The following content is generated by ChatGPT. | Domain | Task / Problem | Unlabeled Data Used | Method Family | Real-World Benefit | Reference | |--------|----------------|---------------------|---------------|-------------------|-----------| | **Image Recognition** | Classifying natural images (CIFAR-10, ImageNet-100) | Millions of unlabeled web images | Consistency Regularization (FixMatch, Mean Teacher) | +15–25% accuracy with 10× fewer labeled samples | Sohn et al., *FixMatch*, 2020 | | **Medical Imaging** | Tumor or lesion segmentation (MRI / CT) | Thousands of unlabeled scans | Generative / Consistency Hybrid (VAE, U-Net) | ~80% annotation cost reduction; works well with rare cases | Bai et al., *MedIA*, 2019 | | **Speech Recognition** | Automatic speech recognition (ASR) | Large amounts of raw audio | Representation Learning (wav2vec 2.0) | Matches full supervision using <10% labeled data | Baevski et al., *wav2vec 2.0*, 2020 | | **Natural Language Processing** | Text classification, sentiment analysis | Billions of unlabeled sentences | Self-Supervised Pretraining (BERT, RoBERTa) | Massive improvement in downstream $P(y \mid x)$ tasks | Devlin et al., *BERT*, 2018 | | **Autonomous Driving** | Scene understanding, lane detection | Continuous unlabeled video streams | Consistency + Pseudo-Labeling | Robust to lighting/weather; reduces manual labels | French et al., 2020 | | **Financial Fraud Detection** | Detecting anomalous transactions | Transaction logs without labels | Generative Modeling (VAE / GAN) | Learns normal patterns → better anomaly detection | Xu et al., *KDD*, 2018 | | **Recommendation Systems** | Predicting user preferences | User–item logs without explicit feedback | Representation Learning (Autoencoder / Contrastive) | Improves cold-start and leverages implicit signals | — | --- ## 6. Common Semi-Supervised Learning Methods **Note**: The following content is generated by ChatGPT. ### Consistency Regularization - **Idea**: Model should produce similar predictions for perturbed versions of the same input - **Examples**: FixMatch, Mean Teacher, Virtual Adversarial Training ### Pseudo-Labeling - **Idea**: Use model's confident predictions on unlabeled data as "soft labels" - **Process**: Train → predict on unlabeled → retrain with pseudo-labels ### Generative Models - **Idea**: Learn $P(x)$ and $P(y|x)$ jointly - **Examples**: VAE, GAN-based approaches ### Self-Supervised Pretraining - **Idea**: Pretrain on unlabeled data with pretext tasks, then fine-tune on labeled data - **Examples**: BERT (masked language modeling), wav2vec 2.0 (contrastive learning) --- *Source: Deep Learning Book (Goodfellow et al.), Chapter 7.6*