Reinforcement Learning in the HighwayEnv

Links to the code: social attention | behavioral cloning & implicit Q learning | state space models.

My motivation for implementing RL algorithms in the HighwayEnv stems from an interest in scenarios where observations are provided as a set of inputs, rather than as feature vectors or images. For instance, consider an aimbot system that uses bounding boxes from a YOLO object detection model, then the observations are given as a set, which is permutation-invariant. This setup requires the use of permutation-invariant architectures, such as Deep Sets or Social Attention. I implemented both, and as anticipated, the attention mechanism performed better.

However, I faced challenges with my implementations of DQN, SAC, and model-based planning (based on state space models with a pre-trained self-attention encoder), which did not yield the desired results. Consequently, I shifted my focus to behavioral cloning and Implicit Q learning (IQL reduces to BC in certain cases). Interestingly, I found that the model was able to mimic my behavior (yes, I spent hours to intereact with the environment to collect human behavior data). For example, I noticed my preference for staying in the rightmost lane when possible and accelerating until blocked by other vehicles. You can observe this in the GIF below:
Highway Environment GIF

After a while, I discovered that the previous approaches likely failed due to poor reward signals for discrete actions, which were consistently close to 1, except in cases of vehicle crashes (which is rare). Additionally, the discrete nature of the actions resulted in an imbalanced behavioral dataset, as the ego vehicle was IDLE most of the time. To mitigate the latter, I applied the technique of Focal Loss. However, for the former issue, collecting manual driving data for continuous actions would be quite challenging, so I did not continue on that direction.

I also used the same implementation of IQL, combined with reward shaping, to explore the potential for learning-based navigation using an OpenCV-based simulator (for the RoboMaster University League robotics competition) wrapped as a Gym environment. The observation is based on the localization result from a particle filter (also implemented by the team I lead). While the code isn’t publicly available, you can view how it performs below:
IQL GIF

The shaded robot represents the state estimation from a particle filter. The shaping function defined by \(F(s, a, s’) = \gamma * \phi(s’) - \phi(s)\) is based on the potential function \[\phi(s) = -(\alpha * \text{dist}(s, g) + \beta * \text{repulsion}(s))\] where the coefficients $\alpha, \beta > 0$ are hyperparameters, \(\text{dist}(s, g)\) is the distance from the current state \(s\) to the goal, and \(\text{repulsion}(s)\) is the value of the repulsion at \(s\). Specifically, I defined \(\text{repulsion}(s)=c^{-d/l}\) where \(c, l > 0\) are hyperparameters and \(d\) is the minimum distance from \(s\) to all the barriers in the environment.