mdp_eval_policy_iterative description

MDP Toolbox for MATLAB

mdp_eval_policy_iterative

Evaluates a policy using iterations of the Bellman operator.

Syntax

Vpolicy = mdp_eval_policy_iterative(P, R, discount, policy)
Vpolicy = mdp_eval_policy_iterative(P, R, discount, policy, V0)
Vpolicy = mdp_eval_policy_iterative(P, R, discount, policy, V0, epsilon)
Vpolicy = mdp_eval_policy_iterative(P, R, discount, policy, V0, epsilon, max_iter)

Description

mdp_eval_policy_iterative evaluates the value fonction associated to a policy applying iteratively the Bellman operator.

Arguments

P : transition probability array.

P can 3 dimensions array (SxSxA) or a cell array (1xA), each cell containing a sparse matrix (SxS).

R : reward array.

R can be a 3 dimensions array (SxSxA) or a cell array (1xA), each cell containing a sparse matrix (SxS) or a 2D array (SxA) possibly sparse.

discount : discount factor.

discount is a real which belongs to [0; 1[.

policy : a policy.

policy is a (Sx1) vector. Each element is an integer corresponding to an action.

V0 (optional) : starting point.

V0 is a (Sx1) vector representing an inital guess of the value function.
By default, V0 is only composed of 0 elements.

epsilon (optional) : search for an epsilon-optimal policy.

epsilon is a real greater than 0.
By default, epsilon = 0.01.

max_iter (optional) : maximum number of iterations.

max_iter is an integer greater than 0. If the value given in argument is greater than a computed bound, a warning informs that the computed bound will be used instead.
By default, max_iter = 1000.

Evaluation

Vpolicy : value fonction.

Vpolicy is a (Sx1) vector.

Example
In grey, verbose mode display.

>> P(:,:,1) = [ 0.5 0.5; 0.8 0.2 ];
>> P(:,:,2) = [ 0 1; 0.1 0.9 ];
>> R = [ 5 10; -1 2 ];
>> policy = [2; 1];

>> Vpolicy = mdp_eval_policy_iterative(P, R, 0.8, policy)
   Iteration V_variation
     1      10
     2      6.24
     3      4.992
     4      3.2727
     5      2.6182
     6      1.7993
     7      1.4394
     8      1.0306
     9      0.82446
     10      0.61003
     11      0.48802
     12      0.37013
     13      0.2961
     14      0.22857
     15      0.18286
     16      0.14288
     17      0.1143
     18      0.090049
     19      0.072039
     20      0.05706
     21      0.045648
     22      0.036285
     23      0.029028
     24      0.023126
     25      0.018501
     26      0.014762
     27      0.011809
     28      0.0094313
     29      0.0075451
     30      0.0060295
     31      0.0048236
     32      0.0038562
     33      0.0030849
     34      0.0024668
     35      0.0019735
     36      0.0015783
     37      0.0012627
     38      0.0010099
     39      0.00080795
     40      0.00064629
     41      0.00051703
     42      0.00041359
     43      0.00033087
     44      0.00026469
     45      0.00021175
     46      0.00016939
     47      0.00013552
     48      0.00010841
     49      8.6728e-05
MDP Toolbox: iterations stopped, epsilon-optimal value function
Vpolicy =
   23.1704
   16.4631

In the above example, P can be a cell array containing sparse matrices:
>> P{1} = sparse([ 0.5 0.5; 0.8 0.2 ]);
>> P{2} = sparse([ 0 1; 0.1 0.9 ]);
The function call is unchanged.

MDP Toolbox for MATLAB

MDPtoolbox/documentation/mdp_eval_policy_iterative.html
Page created on August 31, 2009.