Chapter 4: Problem 77
In a Markov decision problem, another criterion often used, different than the expected average return per unit time, is that of the expected discounted return. In this criterion we choose a number \(\alpha, 0<\alpha<1\), and try to choose a policy so as to maximize \(E\left[\sum_{i=0}^{\infty} \alpha^{i} R\left(X_{i}, a_{i}\right)\right]\) (that is, rewards at time \(n\) are discounted at rate \(\left.\alpha^{n}\right)\) Suppose that the initial state is chosen according to the probabilities \(b_{i} .\) That is, $$ P\left\\{X_{0}=i\right\\}=b_{i}, \quad i=1, \ldots, n $$ For a given policy \(\beta\) let \(y_{j a}\) denote the expected discounted time that the process is in state \(j\) and action \(a\) is chosen. That is, $$ y_{j a}=E_{\beta}\left[\sum_{n=0}^{\infty} \alpha^{n} I_{\left[X_{n}=j, a_{n}=a\right\\}}\right] $$ where for any event \(A\) the indicator variable \(I_{A}\) is defined by $$ I_{A}=\left\\{\begin{array}{ll} 1, & \text { if } A \text { occurs } \\ 0, & \text { otherwise } \end{array}\right. $$ (a) Show that $$ \sum_{a} y_{j a}=E\left[\sum_{n=0}^{\infty} \alpha^{n} I_{\left\\{X_{n}=j\right\\}}\right] $$ or, in other words, \(\sum_{a} y_{j a}\) is the expected discounted time in state \(j\) under \(\beta\). (b) Show that $$ \begin{aligned} \sum_{j} \sum_{a} y_{j a} &=\frac{1}{1-\alpha}, \\ \sum_{a} y_{j a} &=b_{j}+\alpha \sum_{i} \sum_{a} y_{i a} P_{i j}(a) \end{aligned} $$ Hint: For the second equation, use the identity $$ I_{\left\\{X_{n+1}=j\right\\}}=\sum_{i} \sum_{a} I_{\left\\{X_{n-i}, a_{n-a}\right\rangle} I_{\left\\{X_{n+1}=j\right\\}} $$ Take expectations of the preceding to obtain $$ E\left[I_{\left.X_{n+1}=j\right\\}}\right]=\sum_{i} \sum_{a} E\left[I_{\left\\{X_{n-i}, a_{n-a}\right\\}}\right] P_{i j}(a) $$ (c) Let \(\left\\{y_{j a}\right\\}\) be a set of numbers satisfying $$ \begin{aligned} \sum_{j} \sum_{a} y_{j a} &=\frac{1}{1-\alpha}, \\ \sum_{a} y_{j a} &=b_{j}+\alpha \sum_{i} \sum_{a} y_{i a} P_{i j}(a) \end{aligned} $$ Argue that \(y_{j a}\) can be interpreted as the expected discounted time that the process is in state \(j\) and action \(a\) is chosen when the initial state is chosen according to the probabilities \(b_{j}\) and the policy \(\beta\), given by $$ \beta_{i}(a)=\frac{y_{i a}}{\sum_{a} y_{i a}} $$ is employed. Hint: Derive a set of equations for the expected discounted times when policy \(\beta\) is used and show that they are equivalent to Equation \((4.38) .\) (d) Argue that an optimal policy with respect to the expected discounted return criterion can be obtained by first solving the linear program $$ \begin{array}{ll} \operatorname{maximize} & \sum_{j} \sum_{a} y_{j a} R(j, a), \\ \text { such that } & \sum_{j} \sum_{a} y_{j a}=\frac{1}{1-\alpha}, \\ \sum_{a} y_{j a}=b_{j}+\alpha \sum_{i} \sum_{a} y_{i a} P_{i j}(a) \\ y_{j a} \geqslant 0, \quad \text { all } j, a \end{array} $$ and then defining the policy \(\beta^{*}\) by $$ \beta_{i}^{*}(a)=\frac{y_{i a}^{*}}{\sum_{a} y_{i a}^{*}} $$ where the \(y_{j a}^{*}\) are the solutions of the linear program.
Short Answer
Step by step solution
Key Concepts
These are the key concepts you need to understand to accurately answer the question.