.. highlight:: cpp ++++++++++++++++++++ Design Documentation ++++++++++++++++++++ Overview ------------- Basic Components **************** The goal of this module is to support the integration of reinforcement learning (RL) components into network scenarios to simulate their deployment and the communication between them. Typical RL tasks include agents, actions, observations and rewards as their main components. In a network, these components are often placed on different nodes. For example, collecting observations and training an agent often happen at different locations in the network. To associate these RL components with :code:`Node`\ s, the abstraction of user applications is used. The following applications inherit from a general :code:`RlApplication`: * :code:`ObservationApplication`: observes part of the network state and communicates the collected data (i.e. observations or data used to calculate observations) to one or more agents * :code:`RewardApplication`: collects data to calculate a reward and communicates it to one or more agents * :code:`AgentApplication`: represents the training and/or inference agent in the network. * :code:`ActionApplication`: executes an action that was inferred by an agent and thereby changes a part of the network state .. _fig-rlapplication-overview: .. figure:: figures/rlapplication-overview.* :align: center Basic interaction of :code:`RlApplication`\ s A commonly used standard for implementing RL environments is the Gymnasium standard [Gymnasium]_, which is based on Python. With RLLib (Ray) [RLLib]_ an extensive Python library for RL exists that uses this standard as an interface for single-agent training. As *ns-3* is implemented in C++, a connection with the mainly Python-based RL frameworks needs to be established. This module uses *ns3-ai* [ns3-ai]_ for the inter-process communication. Design Criteria *************** Possible use cases this module is designed for are the following: * Simulation of communication overhead between RL components * Simulating how calculation and/or communication delays influence the performance of an RL approach via configurable delays * Testing and evaluating tradeoffs between different RL deployments, e.g., distributed deployment on several nodes vs. centralized deployment on a single node .. _fig-complex-scenario: .. figure:: figures/complex-scenario.* :align: center Example scenario setup that should be supported by the framework To make these generalized use cases possible, the following main requirements have been considered: #. Support integration with existing *ns-3* scenarios with as few assumptions about the scenario as possible (even complex scenarios such as :ref:`fig-complex-scenario` should be supported) #. Support single-agent and multi-agent reinforcement learning (MARL) #. Support communication between RL components via simulated network traffic Class diagram ************* The following class diagram includes all classes provided by DEFIANCE. You can also find member variables and class methods that are particularly important. .. _fig-class-diagram: .. figure:: figures/defiance-classes.* :align: center Customization ------------- This module provides a framework to simulate different RL components by different :code:`RlApplication`\ s. The main tasks that the framework performs for the user in order to make it well usable are the following: * provide frameworks for prototypical :code:`RlApplication`\ s, * provide helper functionality to support creation of :code:`RlApplication`\ s and their installation on :code:`Node`\ s, * enable typical communication between :code:`RlApplication`\ s, and * handle the interaction between :code:`RlApplication`\ s and the Python-based training/inference processes in compliance with the typical RL workflow. In addition to these tasks performed by the framework, some aspects of the :code:`RlApplication`\ s strongly depend on the specific RL task and solution approach that is to be implemented. Therefore, custom code provided by the user of the framework has to be integrated into the :code:`RlApplication`\ s. Typically, this mainly concerns the following aspects of :code:`RlApplication`\ s: * Data collection: How are observations and rewards collected/calculated exactly? * Communication between :code:`RlApplication`\ s: When and to whom are messages sent? * Behavior of agents: At what frequency does the agent step? What triggers a step? * Execution of actions: What happens exactly when a specific action occurs? A typical example of necessary customization is an :code:`ObservationApplication` which should be registered at a specific *ns-3* trace source to provide it with the necessary data. The according trace source and its signature have to be configurable as they depend on the specific scenario. Additionally it should be configurable to which :code:`AgentApplication`\ s the collected data is sent. One option to solve this task are callbacks: The user could create functions outside the according :code:`RlApplication` with a distinct interface. Those could then be registered as callbacks in the according :code:`RlApplication`. Whenever user-specific code is required, the :code:`RlApplication` would then call these callbacks. Similarly, the :code:`RlApplication` could provide a method with a distinct interface. The user then has to register this method at a trace source to provide the :code:`RlApplication` with data. This option is not very flexible as all function signatures have to be fixed and known already when the :code:`RlApplication` class is designed. Another drawback of this approach is that there is no defined location for the custom code of an :code:`RlApplication`. Therefore, an approach using inheritance was chosen: The :code:`RlApplication`\ s are designed as abstract classes from which the user has to inherit in order to add the scenario-specific code. This has the advantage that all code connected to an :code:`RlApplication` is collected in a single class. Additionally, it guarantees that all necessary methods are implemented and usable defaults can be implemented for methods that may be customized. ChannelInterface ---------------- This framework is supposed to allow communication between :code:`RlApplication`\ s in a custom scenario. Therefore, it is the task of the framework user to set up the scenario and the communication channels between :code:`Node`\ s. This implies that the user has to provide the framework with an abstraction of a pre-configured channel over which data can be sent. Intuitively, this would be sockets. Nevertheless, the framework should prevent the user from the overhead of creating sockets. That is why the framework uses IP addresses and the type of protocol as data the user has to provide. Using this data, sockets can be created and connected to each other. :code:`RlApplication`\ s should handle the interfaces of their communication channels transparently, e.g. independent from the protocol type. Additionally, direct communication without simulated network traffic should be possible. To this end, the :code:`ChannelInterface` class was introduced as a generalized interface used in :code:`RlApplication`\ s. It is subclassed by the :code:`SocketChannelInterface` class, which is responsible for creating sockets when provided with the necessary information (IP addresses and protocol type). The :code:`SimpleChannelInterface` provides the :code:`RlApplication`\ s with the same interface while maintaining a direct reference to another :code:`SimpleChannelInterface` to allow communication with a fixed delay (which might also be 0). .. _fig-channel-interfaces: .. figure:: figures/channel-interfaces.* :align: center Communication via :code:`SimpleChannelInterface` and :code:`SocketChannelInterface` It should be noted that the framework should support multiple connections over :code:`ChannelInterface`\ s between a single pair of :code:`RlApplication`\ s to allow using different communication channels. Simulating communication between :code:`RlApplication`\ s over simulated network channels includes the chance that a channel is broken and that therefore no communication is possible. This has to be handled by the underlying protocols or the user of the framework, since the user is responsible for the whole setup and configuration of the concrete network scenario. Design of RlApplications ------------------------ RlApplication ************* The :code:`RlApplication` generalizes functionality that is equal among all applications provided by this module. This includes IDs to identify specific :code:`RlApplication`, functionality to send and to handle :code:`ChannelInterface`\ s. In this way a generalized interface for all possible RL applications is established which can be used by all classes handling all kinds of RL applications, like the :code:`CommunicationHelper` introduced in :ref:`sec-helper`. In theory, multiple :code:`RlApplication`\ s of the same type can be installed on the same :code:`Node`. Nevertheless, this was not tested yet since in most cases tasks of the same type (e.g. collecting observations) do not have to be separated into different applications when performed on the same :code:`Node`. AgentApplication **************** Basic Concept ============= The :code:`AgentApplication` represents an RL agent (which is trained with e.g. RLLib) within the network. It has a scenario-specific observation and action space. Currently, the framework is tested only with fixed observation and action spaces (and not with parametric action spaces). Interaction with other RlApplications ===================================== The :code:`AgentApplication` may receive observations and rewards from one or multiple :code:`ObservationApplication`\ s resp. :code:`RewardApplication`\ s. To support as many use cases as possible, it is also supported to receive any data from :code:`ObservationApplication`\ s resp. :code:`RewardApplication`\ s, which is not immediatly used as observations or rewards but from which observations and rewards are derived by custom calculations. Therefore, the data transmitted from :code:`ObservationApplication`\ s to :code:`AgentApplication`\ s (which is called observation in the following) does not necessarily fit into the observation space of the agent. Likewise, an :code:`AgentApplication` can send actions (or any data derived from it's actions) to one or multiple :code:`ActionApplication`\ s. Additionally to the common RL interactions, this framework also supports transmitting arbitrary messages between :code:`AgentApplication`\ s. This provides users of this framework with the chance to implement a protocol for agent communication. Furthermore, it is the basis for exchanging model updates or policies between agents. Interaction with Python-based learning process ============================================== The :code:`AgentApplication` is intended to interact with the Python-based training/inference processes over the :code:`OpenGymMultiAgentInterface`. This is primarily done by the :code:`AgentApplication::InferAction` method(s), which call(s) :code:`OpenGymMultiAgentInterface::NotifyCurrentState`. This interaction can happen timer-based (i.e. in fixed time intervals) or event-based (e.g. depending on how many observations were received). To have always access to the current observation and reward, which shall be sent to the Python side, the :code:`AgentApplication` stores an :code:`m_observation` and :code:`m_reward` object. Receiving, storing and calculating observations resp. rewards ============================================================= To allow the :code:`AgentApplication` to arbitrarily calculate observations and rewards based on the messages received from :code:`ObservationApplication`\ s and :code:`RewardApplication`\ s, these received messages have to be stored in the :code:`AgentApplication`. For this purpose a new data structure, called :code:`HistoryContainer` was designed. Each :code:`AgentApplication` maintains one :code:`HistoryContainer` for observations (:code:`m_obsDataStruct`) and one for rewards (:code:`m_rewardDataStruct`). :code:`m_obsDataStruct` stores one deque for each connected :code:`ObservationApplication` in which the newest :code:`m_maxObservationHistoryLength` observations received from this :code:`ObservationApplication` are stored. Additionally, :code:`m_obsDataStruct` contains another deque, which stores the newest observations received independent from the :code:`ObservationApplication`. :code:`m_rewardDataStruct` is used equivalently. In this way, the user can specify how much observation and reward data is stored in the :code:`AgentApplication` and use it arbitrarily. Besides storing the received data, it is necessary to inform the :code:`AgentApplication` when an observation or a reward is received. The user can then specify the behavior of the :code:`AgentApplication` in response to such a message. For example, the :code:`AgentApplication` could wait for 10 observations before inferring the next action. This is done by registering the abstract methods :code:`AgentApplication::OnRecvObs` and :code:`AgentApplication::OnRecvReward` at the according :code:`ChannelInterface`\ s. This framework is intended to make communications between RL components more realistic. Nevertheless, it shall still support using global knowledge (e.g. knowledge available on other :code:`Node`\ s) to calculate rewards and observations. Particularly, global knowledge can be helpful to calculate rewards during offline training. If such global knowledge (i.e. data available without delay or communication overhead) shall be used, it can just be accessed when rewards and/or observations are calculated within the :code:`AgentApplication` or data can be transmitted via :code:`SimpleChannelInterface`\ s. Execution of actions ==================== After the :code:`AgentApplication` called :code:`OpenGymMultiAgentInterface::NotifyCurrentState`, it receives an action via :code:`AgentApplication::InitiateAction` from the Python side. To simulate the computation delay of the agent, an :code:`actionDelay` can be configured in :code:`OpenGymMultiAgentInterface::NotifyCurrentState`. Then the :code:`OpenGymMultiAgentInterface` delays calling :code:`AgentApplication::InitiateAction` by the configured actionDelay. Per default, :code:`AgentApplication::InitiateAction` sends the received action to all connected :code:`ActionApplication`\ s. Because data is transmitted via :code:`OpenGymDictContainer`\ s between :code:`RlApplication`\ s, the received action is wrapped into such a container under the key \"default\". This method is intended to be overwritten if different behaviour is needed. In this way, the action can for example be divided into partial actions that are sent to different :code:`ActionApplication`\ s. Alternatively, one could also specify in a part of the action to which :code:`ActionApplication`\ s the action shall be sent. Inference agents vs. training agents ==================================== In many RL tasks different agents perform inference and training. Therefore, one could provide different :code:`AgentApplication` classes for these two purposes. Nevertheless, a general :code:`AgentApplication` class, which can perform both inference and training is also necessary to support e.g. online training. Consequently, the :code:`AgentApplication`\ s used for inference and training would only be specializations of this class, which provide less functionality. That is why it was decided to leave it to the user to use only the functionality which is needed in the current use case. When it is necessary to differentiate between inference and training agents, this can be done e.g. by a flag introduced in a user-defined inherited :code:`RlApplication`. DataCollectorApplication ************************ The :code:`DataCollectorApplication` is the base class which is inherited by :code:`ObservationApplication` and :code:`RewardApplication` since both provide similar functionality: They collect scenario-specific data, maintain :code:`ChannelInterface`\ s connected to :code:`AgentApplication`\ s, and provide functionality to send over these interfaces. To register the applications at scenario-specific trace sources the user has to define a custom :code:`ObservationApplication::Observe` resp. :code:`RewardApplication::Reward` method with a custom signature within the custom :code:`ObservationApplication` resp. :code:`RewardApplication`. To provide a place to connect this custom method with an existing trace source, the abstract :code:`DataCollectorApplication::RegisterCallbacks` method was created. If necessary, the user may also register multiple custom :code:`ObservationApplication::Observe` resp. :code:`RewardApplication::Reward` methods within :code:`DataCollectorApplication::RegisterCallbacks`. To ensure that the callbacks are registered before the simulation starts, :code:`DataCollectorApplication::RegisterCallbacks` is called in the :code:`DataCollectorApplication::Setup` method. Each :code:`ObservationApplication` resp. :code:`RewardApplication` can send observations resp. rewards to one or multiple :code:`AgentApplication`\ s in order not to limit possible scenarios. ActionApplication ***************** The :code:`ActionApplication` provides functionality to maintain :code:`ChannelInterface`\ s which are connected to :code:`AgentApplication`\ s and to receive actions (in the form of :code:`OpenGymDictContainer`\ s). The abstract method :code:`ActionApplication::ExecuteAction`\ s is designed to provide a place for the user-specific code that handles the different actions. This method is automatically called when data is received on the registered :code:`ChannelInterface`\ s. Therefore, it is connected to the according callbacks within the :code:`ActionApplication::AddAgentInterface` method. General Decisions ***************** All :code:`RlApplication`\ s have to store multiple :code:`ChannelInterface`\ s that connect them to other :code:`RlApplication`\ s. Typically, all :code:`ChannelInterface`\ s connected to a specific remote :code:`RlApplication` are used together. Furthermore, multiple :code:`ChannelInterface`\ s between a pair of :code:`RlApplication`\ s have to be supported to enable communication over different channels. Therefore, InterfaceMaps were introduced, which are essentially two-dimensional maps. The outer map is unordered and maps :code:`applicationId`\ s to a second ordered map. The second map maps an ID to the :code:`ChannelInterface`. This ID is unique within this map of :code:`ChannelInterface`\ s connected to a specific :code:`RlApplication`. To ensure this uniqueness, the entries are stored in ascending order of the IDs. In this way, one can simply use the last entry to generate a new unique ID. Connecting two :code:`RlApplication`\ s over multiple :code:`ChannelInterface`\ s is an edge case. Therefore, all :code:`RlApplication::Send` methods are implemented with signatures that allow to send to a specific :code:`RlApplication`. Nevertheless, storing :code:`ChannelInterface`\ s with IDs makes it possible to also provide methods to sent over a certain :code:`ChannelInterface`. We did not consider that during inference the agent might not be able to compute another action. In reality, the computation either needs to be queued ("single threaded") or processed in parallel ("multi threaded"). The latter case is different than the current implementation, because the individual inference times increase with increased parallelism. For a detailed discussion as how to extend the framework with this feature, see :ref:`sec-framework-expansion` In complex scenarios with many :code:`ObservationApplication`\ s and :code:`AgentApplication`\ s each :code:`ObservationApplication` should possibly be able to communicate with each :code:`AgentApplication`. In this case, it is not practicable to configure all communication connections before the simulation started. Therefore, it is necessary to support dynamically adding and removing :code:`ChannelInterface`\ s during simulation time, which is done by :code:`RlApplication::AddInterface` and :code:`RlApplication::DeleteInterface` methods. In some cases, one has to configure something within an :code:`RlApplication` based on the attributes which were set but before the application is started. One example for this is the initialization of data structures with a scenario-dependent length. To provide a central place for such intialization functionality which cannot be placed in the constructor, the :code:`RlApplication::Setup` method was created. Interface for Multi-Agent RL ---------------------------- Gymnasium is a commonly used environment interface for single-agent training, which is also supported by *ns3-ai* [ns3-ai]_. For multi-agent training Ray implemented the MultiAgentEnv API [MultiAgentEnv]_. Besides this API, there is also the PettingZoo API [Pettingzoo]_ proposed by the Farama Foundation. Besides the Agent Environment Cycle (AEC) API, which is the main API of PettingZoo, exists also a Parallel API. For both APIs, RLLib provides a wrapper to make them compatible with the MultiAgentEnv [PettingzooWrapper]_. Since this framework is intended to support multi-agent RL, it had to be decided which API to use. For the chosen API, the *ns3-ai* interface then had to be extended to support multi-agent RL. The basic idea of the AEC [AEC]_ is that agents step sequentially and not in parallel. This restriction is intended to create a better understandable and less error-prone model to prevent developers for example from race conditions. To decide for an API, the following aspects were considered: * The AEC API is a subset of the MultiAgentEnv API, meaning that everything implemented with AEC API is representable with MultiAgentEnv. Using the AEC API would therefore add no functionality, but could be less error-prone because of its restrictions. * For every step of an agent, observations and rewards have to be transferred from C++ to Python and an action back from Python to C++. To avoid difficulties with synchronizing agents, the most simple model is sequentially stepping agents. If agents should step simultaneously this can then be simulated by not continuing the simulation time between their steps. * Including the AEC API when training with RLLib means including a further dependency and the environment would have to be wrapped into a MultiAgentEnv. * According to [PettingzooWrapper]_, AEC expects agents to work in a cooperative manner. However, this framework should support also conflicting agents. * Documentation of RLLib is not as comprehensive as it should be in some places. Nevertheless, there are many code examples for RLLib online to look up. For these reasons, it was decided to use the MultiAgentEnv API instead of the PettingZoo API, but apply the restriction of sequentially stepping agents when expanding *ns3-ai*. This framework should support both single-agent and multi-agent RL. To provide a uniform interface without code duplication, this framework handles single-agent RL as a special case of multi-agent RL. .. _fig-multiagent-interface: .. figure:: figures/multiagent-interface.* :align: center Interaction between *ns-3* simulation (C++) and :code:`Ns3MultiAgentEnv` (Python) Communication between the Python-based training process and the simulation in C++ works over the :code:`Ns3MultiAgentEnv` (in Python) and the :code:`OpenGymMultiAgentInterface` (in C++), which were added to *ns3-ai*. The training/inference process is then initiated by the Python side using :code:`Ns3MultiAgentEnv`. The Python process starts the *ns-3* simulation process (implemented in C++) as a subprocess and waits for receiving observations and rewards from the C++ process. Whenever an agent decides to step (via the :code:`AgentApplication::InferAction` method), the C++ process running the *ns-3* simulation switches back to the Python process via the :code:`OpenGymMultiAgentInterface::NotifyCurrentState` method with the observation and the reward of the according agent. The Python process answers with an action for this agent. Only then, the simulation is resumed and the callback registered in :code:`OpenGymMultiAgentInterface::NotifyCurrentState` is called with the action. Note the one to one relation between environment steps and calls to :code:`AgentApplication::InferAction`. If the simulation does not call :code:`AgentApplication::InferAction`, the environment won't step. .. _sec-helper: Helper --------------- In a typical use case this framework has to be integrated into an existing *ns-3* scenario. In *ns-3*, the concept of helpers is commonly used to simplify the configuration and setup tasks the user has to perform. In *ns-3.42* an :code:`ApplicationHelper` was introduced, which is used to create and install applications of a specified type on :code:`Node`\ s. To avoid repeating casts, which would lead to very cluttered code, an :code:`RlApplicationHelper` was introduced by this framework which returns :code:`RlApplicationContainer`\ s instead of :code:`ApplicationContainer`\ s. The main configuration task of this framework is the setup of all communication connections between :code:`RlApplication`\ s, e.g. the connection of all :code:`ObservationApplication`\ s to their according :code:`AgentApplication`\ s. For this purpose, the :code:`CommunicationHelper` was created. The framework should allow all possible connections between pairs of :code:`RlApplication`\ s without making any restricting assumptions. This is done by letting the user configure the communication relationships via an adjacency list. Thereby, it is even possible to configure multiple different connections, e.g. over different channels between two :code:`RlApplication`\ s. To allow the user to identify :code:`RlApplication`\ s e.g. when passing them to this adjacency list, :code:`RlApplicationId`\ s were introduced. They consist of a part identifying the :code:`applicationType` (e.g. :code:`ObservationApplication`) and an :code:`applicationId` which is unique among all :code:`RlApplication`\ s of this type. In this way, the :code:`applicationType` can be identified when necessary and whenever the :code:`applicationType` is clear, only the :code:`applicationId` is used for identification. The :code:`CommunicationHelper` is also used for creating these unique Ids. To do this, it needs to have access to all :code:`RlApplication`\ s existing in a scenario. One option for this is to create all :code:`RlApplication`\ s within the :code:`CommunicationHelper`. This requires the user to provide the :code:`CommunicationHelper` with all :code:`Node`\ s and the according:code:`applicationType`\ s to install on them. However, this would just move the identification problem to the level of the :code:`Node`\ s. Additionally, this approach would conform less with the general idea that the user defines the location of applications by installing them on :code:`Node`\ s. That is why, the tasks of creating/installing :code:`RlApplication`\ s and configuring them and their communication relationships was split between the :code:`RlApplicationHelper` and the :code:`CommunicationHelper`. In this way, it is required that the user passes all :code:`RlApplication`\ s to the :code:`CommunicationHelper`. Then the :code:`RlApplicationId`\ s can be set by the :code:`CommunicationHelper` via the :code:`CommunicationHelper::SetIds` method. Besides a pair of :code:`RlApplicationId`\ s, the user has to specify in the adjacency list all attributes that are necessary to configure the connection between these :code:`RlApplication`\ s. This is done via :code:`CommunicationAttributes` as a compact format for all possible configuration data. If no information (i.e. :code:`{}`) is provided by the user, the framework will establish :code:`SimpleChannelInterface`\ s, so that as little configuration is required as possible. If :code:`SocketCommunicationAttributes` are provided, the :code:`CommunicationHelper` is responsible for creating the according :code:`ChannelInterface`\ s and connecting them. The main goal when designing this configuration interface was to enable as many configurations as possible, while making as few configurations as possible necessary. That is why, e.g. a default protocol for :code:`SocketCommunicationAttributes` and default IP addresses for each :code:`RlApplication` (that is derived from the list of network interfaces of its :code:`Node`) were implemented. The :code:`CommunicationHelper::Configure` method was introduced to make it possible to simultaneously call the :code:`RlApplication::Setup` method on all :code:`RlApplication`\ s at a time which is independent from e.g. the constructors, so that it can be done after setting the :code:`RlApplicationId`\ s but before setting up the communication relationships. The methods :code:`CommunicationHelper::Configure` and :code:`CommunicationHelper::SetIds` could be called combinedly in a single method, so that the user does not have to call two methods. However, this was not done so far because both methods perform very different tasks. .. _sec-framework-expansion: Framework expansion options --------------------------- * Create interface for sharing model updates or policies between agents. (already implemented to a large extent) * In some network infrastructures it is necessary to outsource training to a remote server, to share learned model updates, or to share policies between participants. To simulate resulting constraints and research possible opportunities it is required to realistically simulate the performance of shared updates and policies as well as their size. This feature addresses issues like: * How is performance affected when learning distributedly? * What burden does resulting communication pose on a network and can it be reduced? * The required communication functionality is already implemented to a large extent: On the *ns-3* side, in :code:`AgentApplication::OnRecvFromAgent` logic to handle model weights, experience, and model update messages need to be handled by the agent. The message flow is depicted in :ref:`fig-model-updates`. .. _fig-model-updates: .. figure:: figures/model-updates.* :align: center Interaction of inference agents, trainings server, and the ns3-ai message interface This message flow is fully implemented; only the ns3-ai message handling on the Python side alongside the interaction with Ray is still missing. * Support moving agents (and other :code:`RlApplication`\ s) to another :code:`Node`. (not started) * In complex scenarios it might be required to change the :code:`Node` from which the agent receives its observations or where it performs its actions. Currently, this would require installing :code:`ObservationApplication`\ s and :code:`ActionApplication`\ s on every possible :code:`Node` and then switch between them when sending. Since this is prone to bugs at runtime and difficult to track especially for bigger scenarios, it would be more handy to move an existing application to a different :code:`Node`. The same applies if agents shall switch the :code:`Node` during simulation time. This would be possible via model updates if an :code:`AgentApplication` was installed on every possible :code:`Node`. However, it would be much easier if it would be possible to move an application to another :code:`Node`. * Checkpointing (almost done) * To simulate inference without training or continue training of promising policies, it is required to implement Ray's checkpointing. We have already implemented inference runs. However, continuing training hasn't been tested yet. * Multithreading vs. Singlethreading (not started) * What happens if multiple observations arrive while the agent is already inferring an action? In a realistic scenario with limited resources, the agent might only be capable of starting a limited amount of threads for inference. Then, increased parallelism increases the inference times for a job. Maybe the node is even single-threaded. To provide inference for all observations it would be required to buffer some of the observations. This feature would allow to simulate thereby introduced latency as well as additional limitations in regard to the buffer size. Scenarios could explore questions like: Which buffer strategies are sensible for overall performance if the buffer is full? How beneficial is it to provide more resources for the agent in order to allow multithreading? This would lead to quantifiable answers to complex optimization problems.