Abstract
Firewalls remain foundational to cybersecurity, yet their traditional perimeter-based role is challenged by the dynamic nature of modern zero-trust and virtualised networks. In these environments, virtual firewalls—software-defined security functions deployed within service graphs—provide flexible, fine-grained control over traffic flows. However, their scalability and performance are often constrained by sub-optimal placement and rule configuration, especially in large or rapidly evolving topologies. This research introduces the Reinforcement Learning–based Optimised Firewall Placement and Configuration (RL_(O)FPC) model, which addresses these challenges through two cooperating reinforcement learning agents. The FRC-Agent manages path computation and rule enforcement to satisfy hard security constraints, while the FPO-Agent determines optimal firewall locations that minimise the number of deployed firewalls and rule instances while maintaining proximity to critical network components. The model is evaluated against the state-of-the-art VEREFOO framework using both the Maximum Flow (MF) and Atomic Predicate (AP) algorithms across 120 synthetic topologies. Results demonstrate that RL_(O)FPC achieves up to 97.6% accuracy in Network Security Requirement (NSR) satisfaction, improves runtime efficiency by up to 27% in high-NSR environments compared with VEREFOO. However, as the number of Allocation Points (APs) increases, the model’s exploration overhead grows, occasionally surpassing VEREFOO’s scalability performance. Despite this, RL_(O)FPC consistently adapts better to topology modifications through localised Q-learning updates rather than full recomputation, confirming its suitability for dynamic, high-assurance network environments.