RL Models

FRobs_RL include an API of all the RL algorithms implemented in stable-baselines3, namely PPO, A2C, DDPG, DQN, SAC and TD3. All of these have an API in which all the algorithm parameters are read from the ROS parameter server, which means that the parameters can be loaded from a YAML file or set/changed through the ROS CLI commands.

In the API of the algorithms is configured the logging system, by default FRobs_RL saves the training logs in both CSV and Tensorboard files. In the API is also selected that the models are trained periodically in intervals set by the user in the YAML file, in case the user does not want to save the models they just need to change the associated parameter.

FRobs_RL include YAML templates for the parameters of all algorithms in the config folder. To use any of the algorithms the user just needs to create a config folder inside the ROS package where the enviroment to train is located, change the parameters as needed and call the algorithm from the Python script.

Basic YAML parameters

All of the algorithms YAML file has the next parameters:

model_params:

    # Training
    training_steps: 5000      # The number of training steps to perform

    # Save params
    save_freq: 1000 # The step interval to save a new model
    save_prefix: ppo_model # Prefix of models saved
    trained_model_name: trained_model # Name of final model to save
    save_replay_buffer: False

    # Load model params
    load_model: False
    model_name: ppo_model_5000_steps

    # Logging parameters
    log_folder: PPO_1
    log_interval: 4 # The number of episodes between logs
    reset_num_timesteps: False # If true, will reset the number of timesteps to 0 every training

    # Use custom policy - Only MlpPolicy is supported (Only used when new model is created)
    use_custom_policy: False
    policy_params:
        net_arch: [400, 300] # List of hidden layer sizes
        activation_fn: relu  # relu, tanh, elu or selu
        features_extractor_class: FlattenExtractor # FlattenExtractor, BaseFeaturesExtractor or CombinedExtractor
        optimizer_class: Adam # Adam, Adadelta, Adagrad, RMSprop or SGD

In these parameters the user can select how many steps to train the model (training_steps), how often the models will be saved (save_freq), the name of the final model saved after all the training steps (trained_model_name), wheter to load a previously saved model and its name (load_model, model_name), etc.

For the logging parameters the user needs to remember to change the log folder parameter every time a train script is executed, as FRobs_RL checks if the folder is created and will raise an error if a folder with the same name exists.

Finally, the user can choose is they want a custom neural network (use_custom_policy) and its arquitecture/parameters like the optimizer or the activation function.

Algorithm parameters

The algorithm specific parameters are located below the general parameters shown above. The algorithm related parameters include wheter to use action noise, or gSDE folder along the algorithm parameters like the learning rate, batch size, etc. The default algorithm specific parameters for PPO are shown below:

# Use SDE
use_sde: False
sde_params:
    sde_sample_freq: -1

# PPO parameters
ppo_params:
    learning_rate: 0.0003
    n_steps: 100    # The number of steps to run for each environment per update (i.e. rollout buffer size is n_steps * n_envs where n_envs is number of environment copies running in parallel)
    batch_size: 100 # Minibatch size
    n_epochs: 5     # Number of epoch when optimizing the surrogate loss
    gamma: 0.99
    gae_lambda: 0.95
    clip_range: 0.2
    ent_coef: 0.0
    vf_coef: 0.5
    max_grad_norm: 0.5

Use of algorithms

After copying the YAML template to the config folder of the user ROS package the user just need to import the algorithm from the library and set the env, the save path where the models will be saved, the log path where all the logs will be saved and the location and name of the YAML paremeter file. After creating the algorithm the user needs to call the train method to initiate the learning process which will be the number of steps specific in the YAML file (training_steps).

An example using the TD3 algorithm is shown below:

from kobuki_maze_rl.task_env import kobuki_maze
from frobs_rl.common import ros_gazebo, ros_node
import gym
import rospy

# Import TD3 algorithm
from frobs_rl.models.td3 import TD3

if __name__ == '__main__':
    # Kill all processes related to previous runs
    ros_node.ros_kill_all_processes()

    # Launch Gazebo
    ros_gazebo.launch_Gazebo(paused=True, gui=False)

    # Start node
    rospy.logwarn("Start")
    rospy.init_node('kobuki_maze_train')

    # Launch the task environment
    env = gym.make('KobukiMazeEnv-v0')

     #--- Normalize action space
    env = NormalizeActionWrapper(env)

    #--- Normalize observation space
    env = NormalizeObservWrapper(env)

    #--- Set max steps
    env = TimeLimitWrapper(env, max_steps=15000)
    env.reset()

     #--- Set the save and log path
    rospack = rospkg.RosPack()
    pkg_path = rospack.get_path("kobuki_maze_rl")


    #-- TD3
    save_path = pkg_path + "/models/td3/"
    log_path = pkg_path + "/logs/td3/"
    model = TD3(env, save_path, log_path, config_file_pkg="kobuki_maze_rl", config_filename="td3.yaml")


    model.train()
    model.save_model()
    model.close_env()

    sys.exit()

Custom algorithms

While all the algorithms in stable-baselines3 are included, the user can use the FRobs_RL library with their custom algorithms, althought the user must handle the model saving and logging processes.

It is recommended that if the user wants to implement a new algorithm it is done inheriting the base RL class of stable-baselines3 located in the official website sb3 base RL class or Github repository

Basic algorithm API

Below is the documentation of the basic API of the RL algorithms inherited by every one of the included algorithms.

class basic_model.BasicModel(env, save_model_path, log_path, ns='/', load_trained=False)[source]

Base class for all the algorithms supported by the frobs_rl library.

Parameters

env – The environment to be used.
save_model_path – The path to save the model.
log_path – The path to save the log.
ns – The namespace of the parameters.
load_trained – Whether or not to load a trained model.

check_env() → bool[source]

Use the stable-baselines check_env method to check the environment.

Returns: True if the environment is correct, False otherwise.
Return type: bool

close_env() → bool[source]

Use the env close method to close the environment.

Returns: True if the environment was closed, False otherwise.
Return type: bool

predict(observation, state=None, mask=None, deterministic=False)[source]

Get the current action based on the observation, state or mask

Parameters

observation (ndarray) – The enviroment observation
state (ndarray) – The previous states of the enviroment, used in recurrent policies.
mask (ndarray) – The mask of the last states, used in recurrent policies.
deterministic (bool) – Whether or not to return deterministic actions.

Returns

The action to be taken and the next state(for recurrent policies)

Return type

ndarray, ndarray

save_model() → bool[source]

Function to save the model.

Returns: True if the model was saved, False otherwise.
Return type: bool

save_replay_buffer() → bool[source]

Funtion to save the replay buffer, to be used the training must be finished or an error will be raised.

Returns: True if the replay buffer was saved, False otherwise.
Return type: bool

set_model_logger() → bool[source]

Function to set the logger of the model.

Returns: True if the logger was set, False otherwise.
Return type: bool

train() → bool[source]

Function to train the model the number of steps specified in the ROS parameter server. The function will automatically save the model after training.

Returns: True if the training was successful, False otherwise.
Return type: bool