rcognita.controllers.CtrlOptPred¶
- class rcognita.controllers.CtrlOptPred(dim_input, dim_output, mode='MPC', ctrl_bnds=[], action_init=[], t0=0, sampling_time=0.1, Nactor=1, pred_step_size=0.1, sys_rhs=[], sys_out=[], state_sys=[], prob_noise_pow=1, is_est_model=0, model_est_stage=1, model_est_period=0.1, buffer_size=20, model_order=3, model_est_checks=0, gamma=1, Ncritic=4, critic_period=0.1, critic_struct='quad-nomix', stage_obj_struct='quadratic', stage_obj_pars=[], observation_target=[])¶
Class of predictive optimal controllers, primarily model-predictive control and predictive reinforcement learning, that optimize a finite-horizon cost.
Currently, the actor model is trivial: an action is generated directly without additional policy parameters.
- dim_input, dim_output
Dimension of input and output which should comply with the system-to-be-controlled.
- Type
: integer
- mode¶
Controller mode. Currently available (\(\rho\) is the stage objective, \(\gamma\) is the discounting factor):
Controller modes¶ Mode
Cost function
‘MPC’ - Model-predictive control (MPC)
\(J_a \left( y_1, \{action\}_1^{N_a} \right)= \sum_{k=1}^{N_a} \gamma^{k-1} \rho(y_k, u_k)\)
‘RQL’ - RL/ADP via \(N_a-1\) roll-outs of \(\rho\)
\(J_a \left( y_1, \{action\}_{1}^{N_a}\right) = \sum_{k=1}^{N_a-1} \gamma^{k-1} \rho(y_k, u_k) + \hat Q^{\theta}(y_{N_a}, u_{N_a})\)
‘SQL’ - RL/ADP via stacked Q-learning
\(J_a \left( y_1, \{action\}_1^{N_a} \right) = \sum_{k=1}^{N_a-1} \hat \gamma^{k-1} Q^{\theta}(y_{N_a}, u_{N_a})\)
Here, \(\theta\) are the critic parameters (neural network weights, say) and \(y_1\) is the current observation.
Add your specification into the table when customizing the agent.
- Type
: string
- ctrl_bnds¶
Box control constraints. First element in each row is the lower bound, the second - the upper bound. If empty, control is unconstrained (default).
- Type
: array of shape
[dim_input, 2]
- action_init¶
Initial action to initialize optimizers.
- Type
: array of shape
[dim_input, ]
- t0¶
Initial value of the controller’s internal clock.
- Type
: number
- sampling_time¶
Controller’s sampling time (in seconds).
- Type
: number
- Nactor¶
Size of prediction horizon \(N_a\).
- Type
: natural number
- pred_step_size¶
Prediction step size in \(J_a\) as defined above (in seconds). Should be a multiple of
sampling_time
. Commonly, equals it, but here left adjustable for convenience. Larger prediction step size leads to longer factual horizon.- Type
: number
- sys_rhs, sys_out
Functions that represent the right-hand side, resp., the output of the exogenously passed model. The latter could be, for instance, the true model of the system. In turn,
state_sys
represents the (true) current state of the system and should be updated accordingly. Parameterssys_rhs, sys_out, state_sys
are used in those controller modes which rely on them.- Type
: functions
- prob_noise_pow¶
Power of probing noise during an initial phase to fill the estimator’s buffer before applying optimal control.
- Type
: number
- is_est_model¶
Flag whether to estimate a system model. See
_estimate_model()
.- Type
: number
- model_est_stage¶
Initial time segment to fill the estimator’s buffer before applying optimal control (in seconds).
- Type
: number
- model_est_period¶
Time between model estimate updates (in seconds).
- Type
: number
- buffer_size¶
Size of the buffer to store data.
- Type
: natural number
- model_order¶
Order of the state-space estimation model
\[\begin{array}{ll} \hat x^+ & = A \hat x + B action, \newline observation^+ & = C \hat x + D action. \end{array}\]See
_estimate_model()
. This is just a particular model estimator. When customizing,_estimate_model()
may be changed and in turn the parametermodel_order
also. For instance, you might want to use an artifial neural net and specify its layers and numbers of neurons, in which casemodel_order
could be substituted for, say,Nlayers
,Nneurons
.- Type
: natural number
- model_est_checks¶
Estimated model parameters can be stored in stacks and the best among the
model_est_checks
last ones is picked. May improve the prediction quality somewhat.- Type
: natural number
- gamma¶
Discounting factor. Characterizes fading of stage objectives along horizon.
- Type
: number in (0, 1]
- Ncritic¶
Critic stack size \(N_c\). The critic optimizes the temporal error which is a measure of critic’s ability to capture the optimal infinite-horizon cost (a.k.a. the value function). The temporal errors are stacked up using the said buffer.
- Type
: natural number
- critic_period¶
The same meaning as
model_est_period
.- Type
: number
- critic_struct¶
Choice of the structure of the critic’s features.
Currently available:
Critic structures¶ Mode
Structure
‘quad-lin’
Quadratic-linear
‘quadratic’
Quadratic
‘quad-nomix’
Quadratic, no mixed terms
‘quad-mix’
Quadratic, no mixed terms in input and output, i.e., \(w_1 y_1^2 + \dots w_p y_p^2 + w_{p+1} y_1 u_1 + \dots w_{\bullet} u_1^2 + \dots\), where \(w\) is the critic’s weight vector
Add your specification into the table when customizing the critic.
- Type
: natural number
- stage_obj_struct¶
Choice of the stage objective structure.
Currently available:
Critic structures¶ Mode
Structure
‘quadratic’
Quadratic \(\chi^\top R_1 \chi\), where \(\chi = [observation, action]\),
stage_obj_pars
should be[R1]
‘biquadratic’
4th order \(\left( \chi^\top \right)^2 R_2 \left( \chi \right)^2 + \chi^\top R_1 \chi\), where \(\chi = [observation, action]\),
stage_obj_pars
should be[R1, R2]
Pass correct stage objective parameters in
stage_obj_pars
(as a list)When customizing the stage objective, add your specification into the table above
- Type
: string
References
- 1
Osinenko, Pavel, et al. “Stacked adaptive dynamic programming with unknown system model.” IFAC-PapersOnLine 50.1 (2017): 4150-4155
- __init__(dim_input, dim_output, mode='MPC', ctrl_bnds=[], action_init=[], t0=0, sampling_time=0.1, Nactor=1, pred_step_size=0.1, sys_rhs=[], sys_out=[], state_sys=[], prob_noise_pow=1, is_est_model=0, model_est_stage=1, model_est_period=0.1, buffer_size=20, model_order=3, model_est_checks=0, gamma=1, Ncritic=4, critic_period=0.1, critic_struct='quad-nomix', stage_obj_struct='quadratic', stage_obj_pars=[], observation_target=[])¶
- Parameters
dim_input (: integer) – Dimension of input and output which should comply with the system-to-be-controlled.
dim_output (: integer) – Dimension of input and output which should comply with the system-to-be-controlled.
mode (: string) –
Controller mode. Currently available (\(\rho\) is the stage objective, \(\gamma\) is the discounting factor):
Controller modes¶ Mode
Cost function
’MPC’ - Model-predictive control (MPC)
\(J_a \left( y_1, \{action\}_1^{N_a} \right)= \sum_{k=1}^{N_a} \gamma^{k-1} \rho(y_k, u_k)\)
’RQL’ - RL/ADP via \(N_a-1\) roll-outs of \(\rho\)
\(J_a \left( y_1, \{action\}_{1}^{N_a}\right) = \sum_{k=1}^{N_a-1} \gamma^{k-1} \rho(y_k, u_k) + \hat Q^{\theta}(y_{N_a}, u_{N_a})\)
’SQL’ - RL/ADP via stacked Q-learning
\(J_a \left( y_1, \{action\}_1^{N_a} \right) = \sum_{k=1}^{N_a-1} \gamma^{k-1} \hat Q^{\theta}(y_{N_a}, u_{N_a})\)
Here, \(\theta\) are the critic parameters (neural network weights, say) and \(y_1\) is the current observation.
Add your specification into the table when customizing the agent .
ctrl_bnds (: array of shape
[dim_input, 2]
) – Box control constraints. First element in each row is the lower bound, the second - the upper bound. If empty, control is unconstrained (default).action_init (: array of shape
[dim_input, ]
) – Initial action to initialize optimizers.t0 (: number) – Initial value of the controller’s internal clock
sampling_time (: number) – Controller’s sampling time (in seconds)
Nactor (: natural number) – Size of prediction horizon \(N_a\)
pred_step_size (: number) – Prediction step size in \(J\) as defined above (in seconds). Should be a multiple of
sampling_time
. Commonly, equals it, but here left adjustable for convenience. Larger prediction step size leads to longer factual horizon.sys_rhs (: functions) – Functions that represent the right-hand side, resp., the output of the exogenously passed model. The latter could be, for instance, the true model of the system. In turn,
state_sys
represents the (true) current state of the system and should be updated accordingly. Parameterssys_rhs, sys_out, state_sys
are used in those controller modes which rely on them.sys_out (: functions) – Functions that represent the right-hand side, resp., the output of the exogenously passed model. The latter could be, for instance, the true model of the system. In turn,
state_sys
represents the (true) current state of the system and should be updated accordingly. Parameterssys_rhs, sys_out, state_sys
are used in those controller modes which rely on them.prob_noise_pow (: number) – Power of probing noise during an initial phase to fill the estimator’s buffer before applying optimal control.
is_est_model (: number) – Flag whether to estimate a system model. See
_estimate_model()
.model_est_stage (: number) – Initial time segment to fill the estimator’s buffer before applying optimal control (in seconds).
model_est_period (: number) – Time between model estimate updates (in seconds).
buffer_size (: natural number) – Size of the buffer to store data.
model_order (: natural number) –
Order of the state-space estimation model
\[\begin{array}{ll} \hat x^+ & = A \hat x + B action, \newline observation^+ & = C \hat x + D action. \end{array}\]See
_estimate_model()
. This is just a particular model estimator. When customizing,_estimate_model()
may be changed and in turn the parametermodel_order
also. For instance, you might want to use an artifial neural net and specify its layers and numbers of neurons, in which casemodel_order
could be substituted for, say,Nlayers
,Nneurons
model_est_checks (: natural number) – Estimated model parameters can be stored in stacks and the best among the
model_est_checks
last ones is picked. May improve the prediction quality somewhat.gamma (: number in (0, 1]) – Discounting factor. Characterizes fading of stage objectives along horizon.
Ncritic (: natural number) – Critic stack size \(N_c\). The critic optimizes the temporal error which is a measure of critic’s ability to capture the optimal infinite-horizon cost (a.k.a. the value function). The temporal errors are stacked up using the said buffer.
critic_period (: number) – The same meaning as
model_est_period
.critic_struct (: natural number) –
Choice of the structure of the critic’s features.
Currently available:
Critic feature structures¶ Mode
Structure
’quad-lin’
Quadratic-linear
’quadratic’
Quadratic
’quad-nomix’
Quadratic, no mixed terms
’quad-mix’
Quadratic, no mixed terms in input and output, i.e., \(w_1 y_1^2 + \dots w_p y_p^2 + w_{p+1} y_1 u_1 + \dots w_{\bullet} u_1^2 + \dots\), where \(w\) is the critic’s weights
Add your specification into the table when customizing the critic.
stage_obj_struct (: string) –
Choice of the stage objective structure.
Currently available:
Running objective structures¶ Mode
Structure
’quadratic’
Quadratic \(\chi^\top R_1 \chi\), where \(\chi = [observation, action]\),
stage_obj_pars
should be[R1]
’biquadratic’
4th order \(\left( \chi^\top \right)^2 R_2 \left( \chi \right)^2 + \chi^\top R_1 \chi\), where \(\chi = [observation, action]\),
stage_obj_pars
should be[R1, R2]
Methods
__init__
(dim_input, dim_output[, mode, …])- param dim_input
Dimension of input and output which should comply with the system-to-be-controlled.
compute_action
(t, observation)Main method.
receive_sys_state
(state)Fetch exogenous model state.
reset
(t0)Resets agent for use in multi-episode simulation.
stage_obj
(observation, action)Stage (equivalently, instantaneous or running) objective.
upd_accum_obj
(observation, action)Sample-to-sample accumulated (summed up or integrated) stage objective.