Reinforcement Learning

Author

Chao Ma

Published

April 21, 2026

Course notes on reinforcement learning, from David Silver’s foundations to RL methods used for language models.

David Silver RL Course - Lecture 10: Classic Games Classic games as RL case studies: game theory, minimax search, self-play TD learning, TD-Gammon, TreeStrap, Monte Carlo tree search, and imperfect-information games.

David Silver RL Course - Lecture 9: Exploration and Exploitation Exploration and exploitation through bandits, regret, epsilon-greedy, UCB, Bayesian bandits, Thompson Sampling, information-state search, contextual bandits, and MDPs.

David Silver RL Course - Lecture 8: Integrating Learning and Planning Model-based RL, learned transition and reward models, planning with simulated experience, Dyna, Dyna-Q, forward search, Monte Carlo tree search, TD search, and Dyna-2.

David Silver RL Course - Lecture 7: Policy Gradient Methods Direct policy optimization with score functions, softmax and Gaussian policies, REINFORCE, actor-critic methods, baselines, advantage functions, eligibility traces, and natural policy gradients.

David Silver RL Course - Lecture 6: Value Function Approximation Value and action-value approximation for large MDPs: features, gradient descent, Monte Carlo and TD targets, convergence caveats, batch least squares, experience replay, and DQN stabilization.

David Silver RL Course - Lecture 5: Model-Free Control Control without a model: epsilon-greedy improvement, GLIE, Monte Carlo control, Sarsa, n-step and lambda methods, off-policy learning, importance sampling, and Q-learning.

CMU Advanced NLP: Reinforcement Learning Reward functions for language models, preference-based reward models, REINFORCE, sequence-level credit assignment, KL regularization, baselines, and PPO.

David Silver RL Course - Lecture 4: Model-Free Prediction Model-free policy evaluation through Monte Carlo returns, first-visit vs every-visit updates, TD learning, the bias-variance tradeoff, and TD(lambda).

David Silver RL Course - Lecture 3: Planning by Dynamic Programming Dynamic programming in known MDPs: optimal substructure, iterative policy evaluation, policy iteration, value iteration, and the classical gridworld examples.

David Silver RL Course - Lecture 2: Markov Decision Process Markov property, transition matrices, Markov reward processes, return and discounting, Bellman equations, and the move from prediction to control in MDPs.

David Silver RL Course - Lecture 1: Introduction to Reinforcement Learning What makes RL different from supervised learning, the agent-environment loop, Markov state, policy/value/model, and the core RL tradeoffs.

--- title: "Reinforcement Learning" author: "Chao Ma" date: "2026-04-21" --- Course notes on reinforcement learning, from David Silver's foundations to RL methods used for language models. --- ::: {.content-grid} ::: {.content-card} **[David Silver RL Course - Lecture 10: Classic Games](david-silver-lecture-10-classic-games.qmd)** Classic games as RL case studies: game theory, minimax search, self-play TD learning, TD-Gammon, TreeStrap, Monte Carlo tree search, and imperfect-information games. ::: ::: {.content-card} **[David Silver RL Course - Lecture 9: Exploration and Exploitation](david-silver-lecture-9-exploration-and-exploitation.qmd)** Exploration and exploitation through bandits, regret, epsilon-greedy, UCB, Bayesian bandits, Thompson Sampling, information-state search, contextual bandits, and MDPs. ::: ::: {.content-card} **[David Silver RL Course - Lecture 8: Integrating Learning and Planning](david-silver-lecture-8-integrating-learning-and-planning.qmd)** Model-based RL, learned transition and reward models, planning with simulated experience, Dyna, Dyna-Q, forward search, Monte Carlo tree search, TD search, and Dyna-2. ::: ::: {.content-card} **[David Silver RL Course - Lecture 7: Policy Gradient Methods](david-silver-lecture-7-policy-gradient-methods.qmd)** Direct policy optimization with score functions, softmax and Gaussian policies, REINFORCE, actor-critic methods, baselines, advantage functions, eligibility traces, and natural policy gradients. ::: ::: {.content-card} **[David Silver RL Course - Lecture 6: Value Function Approximation](david-silver-lecture-6-value-function-approximation.qmd)** Value and action-value approximation for large MDPs: features, gradient descent, Monte Carlo and TD targets, convergence caveats, batch least squares, experience replay, and DQN stabilization. ::: ::: {.content-card} **[David Silver RL Course - Lecture 5: Model-Free Control](david-silver-lecture-5-model-free-control.qmd)** Control without a model: epsilon-greedy improvement, GLIE, Monte Carlo control, Sarsa, n-step and lambda methods, off-policy learning, importance sampling, and Q-learning. ::: ::: {.content-card} **[CMU Advanced NLP: Reinforcement Learning](cmu-advanced-nlp-reinforcement-learning.qmd)** Reward functions for language models, preference-based reward models, REINFORCE, sequence-level credit assignment, KL regularization, baselines, and PPO. ::: ::: {.content-card} **[David Silver RL Course - Lecture 4: Model-Free Prediction](david-silver-lecture-4-model-free-prediction.qmd)** Model-free policy evaluation through Monte Carlo returns, first-visit vs every-visit updates, TD learning, the bias-variance tradeoff, and TD(lambda). ::: ::: {.content-card} **[David Silver RL Course - Lecture 3: Planning by Dynamic Programming](david-silver-lecture-3-planning-by-dynamic-programming.qmd)** Dynamic programming in known MDPs: optimal substructure, iterative policy evaluation, policy iteration, value iteration, and the classical gridworld examples. ::: ::: {.content-card} **[David Silver RL Course - Lecture 2: Markov Decision Process](david-silver-lecture-2-markov-decision-process.qmd)** Markov property, transition matrices, Markov reward processes, return and discounting, Bellman equations, and the move from prediction to control in MDPs. ::: ::: {.content-card} **[David Silver RL Course - Lecture 1: Introduction to Reinforcement Learning](david-silver-lecture-1-introduction-to-reinforcement-learning.qmd)** What makes RL different from supervised learning, the agent-environment loop, Markov state, policy/value/model, and the core RL tradeoffs. ::: :::