COMP 579 - Reinforcement Learning
-
A reinforcement learning (RL) course that I took with Professor Doina Precup during my Masters at
McGill. Topics included:
-
Bandits - Multi-arm bandits, exploration vs exploitation, action-value methods,
epsilon-greedy action selection, UCB action selection,
non-stationary bandits, regret, Hoeffding's inequality, gradient-bandit algorithms,
Boltzmann/softmax exploration, Thompson sampling
-
Markov Decision Processes - Markov property, policies, goals and rewards, action and
state spaces, value functions, state-value functions, state transitions, discounted returns,
episodic vs continuous tasks, Bellman equation
-
Sequential decision making - Iterative policy evaluation and improvement, generalized
policy iteration (GPI), value iteration, dynamic programming, Monte Carlo policy evaluation,
temporal difference (TD) learning methods, on-policy vs off-policy, Q-learning, double
Q-learning, DQN, expected Sarsa, semi-gradient methods
-
Planning - Model-based vs model-free RL, planning from model -> policy, Dyna-Q
algorithm, prioritized sweeping, trajectory sampling, heuristic search, PlaNet, Dreamer,
MuZero
-
Policy-Based RL - Value-based vs policy-based RL, policy-gradient methods, contextual
bandits, policy gradient theorem, actor-critic methods, episodic vs continuous actions
spaces, A2C and A3C, TRPO, PPO, DDPG
-
Hierarchical RL
- Options, option-critic architecture, bottleneck states, termination-critic,
generalized value functions (GVFs)
-
Learning from demonstrations - Reward shaping, imitation learning, behavioral cloning,
DAGGER, inverse-RL, max entropy principle, stochastic MDPs
-
Batch-Constrained RL - BCQ, extrapolation error
-
Additional Topics - Reward-is-enough hypothesis, task types: SOAPs, POs, and TOs,
never-ending/continual RL, non-markovian structures, POMDPs, multi-task RL
- We had three assignments and one final project.
-
Bandit Algorithms - Explored UCB, Thompson sampling, and epsilon greedy algorithms
on a k-armed Bernoulli bandit problem
-
Tabular RL and function approximation - Explored SARSA, expected SARSA, Q-learning,
and actor-critic with linear function approximation algorithms on the Frozen Lake and
Cart-pole domains from the OpenAI gym environment suite
-
Offline-RL - Explored imitation learning and fitted Q-learning methods to learn a
policy from collected data
-
Project - For our project, we implemented Robust Adversarial Reinforcement Learning (RARL) to enable an agent
to learn a policy robust to distribution shifts between the training and testing
environments. We built on the methodology in the paper by using
TD3,
demonstrating that the RARL approach is method-agnotistic, and by expanding the
test scenario set.
Overall, we observed behavior similar to that shown in the original paper, highlighting the
effectiveness of RARL. The full project report can be found here.