Khalil Virji
COMP 579 - Reinforcement Learning
-
This was a Reinforcement Learning course that I took with Professor Doina Precup during my Masters at
McGill. Some
of the topics we covered are listed below.
-
Bandits - Multi-arm bandits, exploration vs exploitation, action-value methods,
epsilon-greedy action selection, UCB action selection,
non-stationary bandits, regret, Hoeffding's inequality, gradient-bandit algorithms,
Boltzmann/softmax exploration, Thompson sampling
-
Markov Decision Processes - Markov property, policies, goals and rewards, action and
state spaces, value functions, state-value functions, state transitions, discounted returns,
episodic vs continuous tasks, Bellman equation
-
Sequential decision making - Iterative policy evaluation and improvement, generalized
policy iteration (GPI), value iteration, dynamic programming, Monte Carlo policy evaluation,
temporal difference (TD) learning methods, on-policy vs off-policy, Q-learning, double
Q-learning, DQN, expected Sarsa, semi-gradient methods
-
Planning - Model-based vs model-free RL, planning from model -> policy, Dyna-Q
algorithm, prioritized sweeping, trajectory sampling, heuristic search, PlaNet, Dreamer,
MuZero
-
Policy-Based RL - Value-based vs policy-based RL, policy-gradient methods, contextual
bandits, policy gradient theorem, actor-critic methods, episodic vs continuous actions
spaces, A2C and A3C, TRPO, PPO, DDPG
-
Hierarchical RL
- Options, option-critic architecture, bottleneck states, termination-critic,
generalized value functions (GVFs)
-
Learning from demonstrations - Reward shaping, imitation learning, behavioral cloning,
DAGGER, inverse-RL, max entropy principle, stochastic MDPs
-
Batch-Constrained RL - BCQ, extrapolation error
-
Additional Topics - Reward-is-enough hypothesis, task types: SOAPs, POs, and TOs,
never-ending/continual RL, non-markovian structures, POMDPs, multi-task RL
- We also had three assignments and one project throughout the course.
-
Bandit Algorithms - Explored the UCB, Thompson sampling, and epsilon greedy algorithms
on a k-armed Bernoulli bandit problem
-
Tabular RL and function approximation - Explored SARSA, expected SARSA, Q-learning,
and actor-critic with linear function approximation algorithms on the Frozen Lake and
Cart-pole domains from the OpenAI gym environment suite
-
Offline-RL - Explored imitation learning and fitted Q-learning methods to learn a
policy from collected data
-
Project - For our project, we investigated Robust Adversarial Reinforcement Learning (RARL) to have an
agent learn a policy robust to distribution shifts between training and testing environments. We built on the methodology in the paper using
a recent state-of-the-art learning method (TD3) and expanding the testing environments. Overall, we observed similar behavior and
results as produced in the original paper, highlighting the effectiveness of this approach. The full project report can be found here.