rcognita.controllers.CtrlOptPred

class rcognita.controllers.CtrlOptPred(dim_input, dim_output, mode='MPC', ctrl_bnds=[], action_init=[], t0=0, sampling_time=0.1, Nactor=1, pred_step_size=0.1, sys_rhs=[], sys_out=[], state_sys=[], prob_noise_pow=1, is_est_model=0, model_est_stage=1, model_est_period=0.1, buffer_size=20, model_order=3, model_est_checks=0, gamma=1, Ncritic=4, critic_period=0.1, critic_struct='quad-nomix', stage_obj_struct='quadratic', stage_obj_pars=[], observation_target=[])

Class of predictive optimal controllers, primarily model-predictive control and predictive reinforcement learning, that optimize a finite-horizon cost.

Currently, the actor model is trivial: an action is generated directly without additional policy parameters.

dim_input, dim_output

Dimension of input and output which should comply with the system-to-be-controlled.

Type

: integer

mode

Controller mode. Currently available (\(\rho\) is the stage objective, \(\gamma\) is the discounting factor):

Controller modes

Mode

Cost function

‘MPC’ - Model-predictive control (MPC)

\(J_a \left( y_1, \{action\}_1^{N_a} \right)= \sum_{k=1}^{N_a} \gamma^{k-1} \rho(y_k, u_k)\)

‘RQL’ - RL/ADP via \(N_a-1\) roll-outs of \(\rho\)

\(J_a \left( y_1, \{action\}_{1}^{N_a}\right) = \sum_{k=1}^{N_a-1} \gamma^{k-1} \rho(y_k, u_k) + \hat Q^{\theta}(y_{N_a}, u_{N_a})\)

‘SQL’ - RL/ADP via stacked Q-learning

\(J_a \left( y_1, \{action\}_1^{N_a} \right) = \sum_{k=1}^{N_a-1} \hat \gamma^{k-1} Q^{\theta}(y_{N_a}, u_{N_a})\)

Here, \(\theta\) are the critic parameters (neural network weights, say) and \(y_1\) is the current observation.

Add your specification into the table when customizing the agent.

Type

: string

ctrl_bnds

Box control constraints. First element in each row is the lower bound, the second - the upper bound. If empty, control is unconstrained (default).

Type

: array of shape [dim_input, 2]

action_init

Initial action to initialize optimizers.

Type

: array of shape [dim_input, ]

t0

Initial value of the controller’s internal clock.

Type

: number

sampling_time

Controller’s sampling time (in seconds).

Type

: number

Nactor

Size of prediction horizon \(N_a\).

Type

: natural number

pred_step_size

Prediction step size in \(J_a\) as defined above (in seconds). Should be a multiple of sampling_time. Commonly, equals it, but here left adjustable for convenience. Larger prediction step size leads to longer factual horizon.

Type

: number

sys_rhs, sys_out

Functions that represent the right-hand side, resp., the output of the exogenously passed model. The latter could be, for instance, the true model of the system. In turn, state_sys represents the (true) current state of the system and should be updated accordingly. Parameters sys_rhs, sys_out, state_sys are used in those controller modes which rely on them.

Type

: functions

prob_noise_pow

Power of probing noise during an initial phase to fill the estimator’s buffer before applying optimal control.

Type

: number

is_est_model

Flag whether to estimate a system model. See _estimate_model().

Type

: number

model_est_stage

Initial time segment to fill the estimator’s buffer before applying optimal control (in seconds).

Type

: number

model_est_period

Time between model estimate updates (in seconds).

Type

: number

buffer_size

Size of the buffer to store data.

Type

: natural number

model_order

Order of the state-space estimation model

\[\begin{array}{ll} \hat x^+ & = A \hat x + B action, \newline observation^+ & = C \hat x + D action. \end{array}\]

See _estimate_model(). This is just a particular model estimator. When customizing, _estimate_model() may be changed and in turn the parameter model_order also. For instance, you might want to use an artifial neural net and specify its layers and numbers of neurons, in which case model_order could be substituted for, say, Nlayers, Nneurons.

Type

: natural number

model_est_checks

Estimated model parameters can be stored in stacks and the best among the model_est_checks last ones is picked. May improve the prediction quality somewhat.

Type

: natural number

gamma

Discounting factor. Characterizes fading of stage objectives along horizon.

Type

: number in (0, 1]

Ncritic

Critic stack size \(N_c\). The critic optimizes the temporal error which is a measure of critic’s ability to capture the optimal infinite-horizon cost (a.k.a. the value function). The temporal errors are stacked up using the said buffer.

Type

: natural number

critic_period

The same meaning as model_est_period.

Type

: number

critic_struct

Choice of the structure of the critic’s features.

Currently available:

Critic structures

Mode

Structure

‘quad-lin’

Quadratic-linear

‘quadratic’

Quadratic

‘quad-nomix’

Quadratic, no mixed terms

‘quad-mix’

Quadratic, no mixed terms in input and output, i.e., \(w_1 y_1^2 + \dots w_p y_p^2 + w_{p+1} y_1 u_1 + \dots w_{\bullet} u_1^2 + \dots\), where \(w\) is the critic’s weight vector

Add your specification into the table when customizing the critic.

Type

: natural number

stage_obj_struct

Choice of the stage objective structure.

Currently available:

Critic structures

Mode

Structure

‘quadratic’

Quadratic \(\chi^\top R_1 \chi\), where \(\chi = [observation, action]\), stage_obj_pars should be [R1]

‘biquadratic’

4th order \(\left( \chi^\top \right)^2 R_2 \left( \chi \right)^2 + \chi^\top R_1 \chi\), where \(\chi = [observation, action]\), stage_obj_pars should be [R1, R2]

Pass correct stage objective parameters in stage_obj_pars (as a list)

When customizing the stage objective, add your specification into the table above

Type

: string

References

1

Osinenko, Pavel, et al. “Stacked adaptive dynamic programming with unknown system model.” IFAC-PapersOnLine 50.1 (2017): 4150-4155

__init__(dim_input, dim_output, mode='MPC', ctrl_bnds=[], action_init=[], t0=0, sampling_time=0.1, Nactor=1, pred_step_size=0.1, sys_rhs=[], sys_out=[], state_sys=[], prob_noise_pow=1, is_est_model=0, model_est_stage=1, model_est_period=0.1, buffer_size=20, model_order=3, model_est_checks=0, gamma=1, Ncritic=4, critic_period=0.1, critic_struct='quad-nomix', stage_obj_struct='quadratic', stage_obj_pars=[], observation_target=[])
Parameters
  • dim_input (: integer) – Dimension of input and output which should comply with the system-to-be-controlled.

  • dim_output (: integer) – Dimension of input and output which should comply with the system-to-be-controlled.

  • mode (: string) –

    Controller mode. Currently available (\(\rho\) is the stage objective, \(\gamma\) is the discounting factor):

    Controller modes

    Mode

    Cost function

    ’MPC’ - Model-predictive control (MPC)

    \(J_a \left( y_1, \{action\}_1^{N_a} \right)= \sum_{k=1}^{N_a} \gamma^{k-1} \rho(y_k, u_k)\)

    ’RQL’ - RL/ADP via \(N_a-1\) roll-outs of \(\rho\)

    \(J_a \left( y_1, \{action\}_{1}^{N_a}\right) = \sum_{k=1}^{N_a-1} \gamma^{k-1} \rho(y_k, u_k) + \hat Q^{\theta}(y_{N_a}, u_{N_a})\)

    ’SQL’ - RL/ADP via stacked Q-learning

    \(J_a \left( y_1, \{action\}_1^{N_a} \right) = \sum_{k=1}^{N_a-1} \gamma^{k-1} \hat Q^{\theta}(y_{N_a}, u_{N_a})\)

    Here, \(\theta\) are the critic parameters (neural network weights, say) and \(y_1\) is the current observation.

    Add your specification into the table when customizing the agent .

  • ctrl_bnds (: array of shape [dim_input, 2]) – Box control constraints. First element in each row is the lower bound, the second - the upper bound. If empty, control is unconstrained (default).

  • action_init (: array of shape [dim_input, ]) – Initial action to initialize optimizers.

  • t0 (: number) – Initial value of the controller’s internal clock

  • sampling_time (: number) – Controller’s sampling time (in seconds)

  • Nactor (: natural number) – Size of prediction horizon \(N_a\)

  • pred_step_size (: number) – Prediction step size in \(J\) as defined above (in seconds). Should be a multiple of sampling_time. Commonly, equals it, but here left adjustable for convenience. Larger prediction step size leads to longer factual horizon.

  • sys_rhs (: functions) – Functions that represent the right-hand side, resp., the output of the exogenously passed model. The latter could be, for instance, the true model of the system. In turn, state_sys represents the (true) current state of the system and should be updated accordingly. Parameters sys_rhs, sys_out, state_sys are used in those controller modes which rely on them.

  • sys_out (: functions) – Functions that represent the right-hand side, resp., the output of the exogenously passed model. The latter could be, for instance, the true model of the system. In turn, state_sys represents the (true) current state of the system and should be updated accordingly. Parameters sys_rhs, sys_out, state_sys are used in those controller modes which rely on them.

  • prob_noise_pow (: number) – Power of probing noise during an initial phase to fill the estimator’s buffer before applying optimal control.

  • is_est_model (: number) – Flag whether to estimate a system model. See _estimate_model().

  • model_est_stage (: number) – Initial time segment to fill the estimator’s buffer before applying optimal control (in seconds).

  • model_est_period (: number) – Time between model estimate updates (in seconds).

  • buffer_size (: natural number) – Size of the buffer to store data.

  • model_order (: natural number) –

    Order of the state-space estimation model

    \[\begin{array}{ll} \hat x^+ & = A \hat x + B action, \newline observation^+ & = C \hat x + D action. \end{array}\]

    See _estimate_model(). This is just a particular model estimator. When customizing, _estimate_model() may be changed and in turn the parameter model_order also. For instance, you might want to use an artifial neural net and specify its layers and numbers of neurons, in which case model_order could be substituted for, say, Nlayers, Nneurons

  • model_est_checks (: natural number) – Estimated model parameters can be stored in stacks and the best among the model_est_checks last ones is picked. May improve the prediction quality somewhat.

  • gamma (: number in (0, 1]) – Discounting factor. Characterizes fading of stage objectives along horizon.

  • Ncritic (: natural number) – Critic stack size \(N_c\). The critic optimizes the temporal error which is a measure of critic’s ability to capture the optimal infinite-horizon cost (a.k.a. the value function). The temporal errors are stacked up using the said buffer.

  • critic_period (: number) – The same meaning as model_est_period.

  • critic_struct (: natural number) –

    Choice of the structure of the critic’s features.

    Currently available:

    Critic feature structures

    Mode

    Structure

    ’quad-lin’

    Quadratic-linear

    ’quadratic’

    Quadratic

    ’quad-nomix’

    Quadratic, no mixed terms

    ’quad-mix’

    Quadratic, no mixed terms in input and output, i.e., \(w_1 y_1^2 + \dots w_p y_p^2 + w_{p+1} y_1 u_1 + \dots w_{\bullet} u_1^2 + \dots\), where \(w\) is the critic’s weights

    Add your specification into the table when customizing the critic.

  • stage_obj_struct (: string) –

    Choice of the stage objective structure.

    Currently available:

    Running objective structures

    Mode

    Structure

    ’quadratic’

    Quadratic \(\chi^\top R_1 \chi\), where \(\chi = [observation, action]\), stage_obj_pars should be [R1]

    ’biquadratic’

    4th order \(\left( \chi^\top \right)^2 R_2 \left( \chi \right)^2 + \chi^\top R_1 \chi\), where \(\chi = [observation, action]\), stage_obj_pars should be [R1, R2]

Methods

__init__(dim_input, dim_output[, mode, …])

param dim_input

Dimension of input and output which should comply with the system-to-be-controlled.

compute_action(t, observation)

Main method.

receive_sys_state(state)

Fetch exogenous model state.

reset(t0)

Resets agent for use in multi-episode simulation.

stage_obj(observation, action)

Stage (equivalently, instantaneous or running) objective.

upd_accum_obj(observation, action)

Sample-to-sample accumulated (summed up or integrated) stage objective.