Proximal Policy Optimization is a popular on-policy reinforcement learning algorithm but is essentially less utilized than off-policy learning algorithms in multi-agent settings. This is often because on-policy methods are essentially less sample efficient than their off-policy counterparts in multi-agent problems. There is a variant of PPO which is specialized for multi-agent settings known as MAPPO. It is a multi-agent task where a group of agents tries to fully utilize a shared reward function. Each agent is decentralized and only has access to locally available information like in StarcraftII, therefore each agent observes enemies within its area only. MAPPO, like PPO, directs two neural networks, a policy network (called an actor) πθ to enumerate actions and a value-function network (called a critic) Vϕ which assesses the quality of a state. MAPPO is a policy-based algorithm and therefore updates πθ using gradient ascent on the objective function.
Implementation Guidelines For Practical Performance of MAPPO
Training Data Usage
It is normal for PPO to implement many eras of updates on a batch of training data using mini-batch gradient parentage. In single-agent settings, data is commonly reused through tens of training eras and many mini-batches per era. High data reuse is harmful in multi-agent settings. Therefore using 15 training epochs for simple tasks, and 10 or 5 epochs for more difficult tasks is recommended. The number of training epochs can regulate the challenge of non-stationarity in MARL. Using more training epochs will result in larger changes to the agents’ policies, which can increase the non-stationarity challenge.
The key attribute of PPO is the use of clipping in the policy. Clipping is used to prevent the policy and value functions from drastically changing between repetitions and helps to stabilize the training process. The robustness of the clipping is regulated by the ϵ hyperparameter, large ϵ enables for larger changes to the policy and value function. Similar to mini-batching, clipping may manage the non-stationarity problem, as smaller ϵ values uplift agents’ policies and restrain from changing drastically.
The scale of the reward functions can hugely vary across environments, and having large reward scales can undermine value learning. Therefore the use of value normalization to normalize the regression targets into a range between 0 and 1 during value learning is recommended. It is also important to find that the process of value normalizations helps MAPPO’s performance.
Centralized Training With Decentralized Execution
Since the value function is only used during training updates and is not needed to evaluate actions, it can make use of global information to make more accurate assumptions. This practice is also used in other multi-agent policy gradient procedures and is cited as centralized training with decentralized execution. Incorporating both local and global information in the value function is most beneficial but excluding important local information can be highly deleterious.
Unlike in single-agent settings, certain agents can become slow or inactive in the environment before the game concludes, this can be seen mainly in SMAC. When an agent is inactive rather than using the global state, using a zero vector with the agent’s ID which is called a death mask, is more effective. Using a death mask enables the value function to more correctly represent states in which the agent is inactive.