mdp_policy_iteration_modified description

MDP Toolbox for MATLAB

mdp_policy_iteration_modified

Solves discounted MDP with modified policy iteration algorithm.

Syntax

[V, policy, iter, cpu_time] = mdp_value_iteration_modified (P, R, discount)
[V, policy, iter, cpu_time] = mdp_value_iteration_modified (P, R, discount, epsilon)
[V, policy, iter, cpu_time] = mdp_value_iteration_modified (P, R, discount, epsilon, max_iter)

Description

mdp_policy_iteration_modified applies the modified policy iteration algorithm to solve discounted MDP. The algorithm consists, like policy iteration one, in improving the policy iteratively but in policy evaluation few iterations (max_iter) of value function updates done.
Iterating is stopped when an epsilon-optimal policy is found.
This function uses verbose and silent modes. In verbose mode, the function displays the variation of V for each iteration.

Arguments

P : transition probability array.

P can be a 3 dimensions array (SxSxA) or a cell array (1xA), each cell containing a sparse matrix (SxS).

R : reward array.

R can be a 3 dimensions array (SxSxA) or a cell array (1xA), each cell containing a sparse matrix (SxS) or a 2D array (SxA) possibly sparse.

discount : discount factor.

discount is a real which belongs to ]0; 1].
For discount equals to 1, a warning recalls to check conditions of convergence.

epsilon (optional) : search for an epsilon-optimal policy.

epsilon is a real in ]0; 1].

By default, epsilon = 0.01.

max_iter (optional) : maximum number of iterations to be done.

max_iter is an integer greater than 0.
By default, max_iter = 1000.

Evaluations

V : optimal value fonction.

V is a (Sx1) vector.

policy : optimal policy.

policy is a (Sx1) vector. Each element is an integer corresponding to an action which maximizes the value function.

iter : number of iterations.

cpu_time : CPU time used to run the program.

Example
In grey, verbose mode display.

>> P(:,:,1) = [ 0.5 0.5; 0.8 0.2 ];
>> P(:,:,2) = [ 0 1; 0.1 0.9 ];
>> R = [ 5 10; -1 2 ];

>> [V, policy, iter, cpu_time] = mdp_policy_iteration_modified(P, R, 0.9)
Iteration V_variation
1 8
2 1.6239
3 0.043773
4 0.0011799
5 3.1807e-05
V =
41.8656
35.4703
policy =
2
1
iter =
5
cpu_time =
0.0500

In the above example, P can be a cell array containing sparse matrices:
>> P{1} = sparse([ 0.5 0.5; 0.8 0.2 ]);
>> P{2} = sparse([ 0 1; 0.1 0.9 ]);
The function call is unchanged.

MDP Toolbox for MATLAB

MDPtoolbox/documentation/mdp_policy_iteration_modified.html
Page created on July 31, 2001. Last update on August 31, 2009.