Khalil Virji | COMP 579

COMP 579 - Reinforcement Learning

This was a Reinforcement Learning course that I took with Professor Doina Precup during my Masters at McGill. Some of the topics we covered are listed below.
- Bandits - Multi-arm bandits, exploration vs exploitation, action-value methods, epsilon-greedy action selection, UCB action selection, non-stationary bandits, regret, Hoeffding's inequality, gradient-bandit algorithms, Boltzmann/softmax exploration, Thompson sampling
- Markov Decision Processes - Markov property, policies, goals and rewards, action and state spaces, value functions, state-value functions, state transitions, discounted returns, episodic vs continuous tasks, Bellman equation
- Sequential decision making - Iterative policy evaluation and improvement, generalized policy iteration (GPI), value iteration, dynamic programming, Monte Carlo policy evaluation, temporal difference (TD) learning methods, on-policy vs off-policy, Q-learning, double Q-learning, DQN, expected Sarsa, semi-gradient methods
- Planning - Model-based vs model-free RL, planning from model -> policy, Dyna-Q algorithm, prioritized sweeping, trajectory sampling, heuristic search, PlaNet, Dreamer, MuZero
- Policy-Based RL - Value-based vs policy-based RL, policy-gradient methods, contextual bandits, policy gradient theorem, actor-critic methods, episodic vs continuous actions spaces, A2C and A3C, TRPO, PPO, DDPG
- Hierarchical RL - Options, option-critic architecture, bottleneck states, termination-critic, generalized value functions (GVFs)
- Learning from demonstrations - Reward shaping, imitation learning, behavioral cloning, DAGGER, inverse-RL, max entropy principle, stochastic MDPs
- Batch-Constrained RL - BCQ, extrapolation error
- Additional Topics - Reward-is-enough hypothesis, task types: SOAPs, POs, and TOs, never-ending/continual RL, non-markovian structures, POMDPs, multi-task RL
We also had three assignments and one project throughout the course.
1. Bandit Algorithms - Explored the UCB, Thompson sampling, and epsilon greedy algorithms on a k-armed Bernoulli bandit problem
2. Tabular RL and function approximation - Explored SARSA, expected SARSA, Q-learning, and actor-critic with linear function approximation algorithms on the Frozen Lake and Cart-pole domains from the OpenAI gym environment suite
3. Offline-RL - Explored imitation learning and fitted Q-learning methods to learn a policy from collected data
4. Project - For our project, we investigated Robust Adversarial Reinforcement Learning (RARL) to have an agent learn a policy robust to distribution shifts between training and testing environments. We built on the methodology in the paper using a recent state-of-the-art learning method (TD3) and expanding the testing environments. Overall, we observed similar behavior and results as produced in the original paper, highlighting the effectiveness of this approach. The full project report can be found here.