Reinforcement Learning

Author

Chao Ma

Published

April 21, 2026

Course notes on reinforcement learning, from David Silver’s foundations to RL methods used for language models.


David Silver RL Course - Lecture 10: Classic Games Classic games as RL case studies: game theory, minimax search, self-play TD learning, TD-Gammon, TreeStrap, Monte Carlo tree search, and imperfect-information games.

David Silver RL Course - Lecture 9: Exploration and Exploitation Exploration and exploitation through bandits, regret, epsilon-greedy, UCB, Bayesian bandits, Thompson Sampling, information-state search, contextual bandits, and MDPs.

David Silver RL Course - Lecture 8: Integrating Learning and Planning Model-based RL, learned transition and reward models, planning with simulated experience, Dyna, Dyna-Q, forward search, Monte Carlo tree search, TD search, and Dyna-2.

David Silver RL Course - Lecture 7: Policy Gradient Methods Direct policy optimization with score functions, softmax and Gaussian policies, REINFORCE, actor-critic methods, baselines, advantage functions, eligibility traces, and natural policy gradients.

David Silver RL Course - Lecture 6: Value Function Approximation Value and action-value approximation for large MDPs: features, gradient descent, Monte Carlo and TD targets, convergence caveats, batch least squares, experience replay, and DQN stabilization.

David Silver RL Course - Lecture 5: Model-Free Control Control without a model: epsilon-greedy improvement, GLIE, Monte Carlo control, Sarsa, n-step and lambda methods, off-policy learning, importance sampling, and Q-learning.

CMU Advanced NLP: Reinforcement Learning Reward functions for language models, preference-based reward models, REINFORCE, sequence-level credit assignment, KL regularization, baselines, and PPO.

David Silver RL Course - Lecture 4: Model-Free Prediction Model-free policy evaluation through Monte Carlo returns, first-visit vs every-visit updates, TD learning, the bias-variance tradeoff, and TD(lambda).

David Silver RL Course - Lecture 3: Planning by Dynamic Programming Dynamic programming in known MDPs: optimal substructure, iterative policy evaluation, policy iteration, value iteration, and the classical gridworld examples.

David Silver RL Course - Lecture 2: Markov Decision Process Markov property, transition matrices, Markov reward processes, return and discounting, Bellman equations, and the move from prediction to control in MDPs.

David Silver RL Course - Lecture 1: Introduction to Reinforcement Learning What makes RL different from supervised learning, the agent-environment loop, Markov state, policy/value/model, and the core RL tradeoffs.