Chapter 4: Problem 77

In a Markov decision problem, another criterion often used, different than the expected average return per unit time, is that of the expected discounted return. In this criterion we choose a number $\alpha, 0<\alpha<1$, and try to choose a policy so as to maximize $E\left[\sum_{i=0}^{\infty} \alpha^{i} R\left(X_{i}, a_{i}\right)\right]$ (that is, rewards at time $n$ are discounted at rate $\left.\alpha^{n}\right)$ Suppose that the initial state is chosen according to the probabilities $b_{i} .$ That is, $$ P\left\\{X_{0}=i\right\\}=b_{i}, \quad i=1, \ldots, n $$ For a given policy $\beta$ let $y_{j a}$ denote the expected discounted time that the process is in state $j$ and action $a$ is chosen. That is, $$ y_{j a}=E_{\beta}\left[\sum_{n=0}^{\infty} \alpha^{n} I_{\left[X_{n}=j, a_{n}=a\right\\}}\right] $$ where for any event $A$ the indicator variable $I_{A}$ is defined by $$ I_{A}=\left\\{\begin{array}{ll} 1, & \text { if } A \text { occurs } \\ 0, & \text { otherwise } \end{array}\right. $$ (a) Show that $$ \sum_{a} y_{j a}=E\left[\sum_{n=0}^{\infty} \alpha^{n} I_{\left\\{X_{n}=j\right\\}}\right] $$ or, in other words, $\sum_{a} y_{j a}$ is the expected discounted time in state $j$ under $\beta$. (b) Show that $$ \begin{aligned} \sum_{j} \sum_{a} y_{j a} &=\frac{1}{1-\alpha}, \\ \sum_{a} y_{j a} &=b_{j}+\alpha \sum_{i} \sum_{a} y_{i a} P_{i j}(a) \end{aligned} $$ Hint: For the second equation, use the identity $$ I_{\left\\{X_{n+1}=j\right\\}}=\sum_{i} \sum_{a} I_{\left\\{X_{n-i}, a_{n-a}\right\rangle} I_{\left\\{X_{n+1}=j\right\\}} $$ Take expectations of the preceding to obtain $$ E\left[I_{\left.X_{n+1}=j\right\\}}\right]=\sum_{i} \sum_{a} E\left[I_{\left\\{X_{n-i}, a_{n-a}\right\\}}\right] P_{i j}(a) $$ (c) Let $\left\\{y_{j a}\right\\}$ be a set of numbers satisfying $$ \begin{aligned} \sum_{j} \sum_{a} y_{j a} &=\frac{1}{1-\alpha}, \\ \sum_{a} y_{j a} &=b_{j}+\alpha \sum_{i} \sum_{a} y_{i a} P_{i j}(a) \end{aligned} $$ Argue that $y_{j a}$ can be interpreted as the expected discounted time that the process is in state $j$ and action $a$ is chosen when the initial state is chosen according to the probabilities $b_{j}$ and the policy $\beta$, given by $$ \beta_{i}(a)=\frac{y_{i a}}{\sum_{a} y_{i a}} $$ is employed. Hint: Derive a set of equations for the expected discounted times when policy $\beta$ is used and show that they are equivalent to Equation $(4.38) .$ (d) Argue that an optimal policy with respect to the expected discounted return criterion can be obtained by first solving the linear program $$ \begin{array}{ll} \operatorname{maximize} & \sum_{j} \sum_{a} y_{j a} R(j, a), \\ \text { such that } & \sum_{j} \sum_{a} y_{j a}=\frac{1}{1-\alpha}, \\ \sum_{a} y_{j a}=b_{j}+\alpha \sum_{i} \sum_{a} y_{i a} P_{i j}(a) \\ y_{j a} \geqslant 0, \quad \text { all } j, a \end{array} $$ and then defining the policy $\beta^{}$ by $$ \beta_{i}^{}(a)=\frac{y_{i a}^{}}{\sum_{a} y_{i a}^{}} $$ where the $y_{j a}^{*}$ are the solutions of the linear program.

Short Answer

Expert verified

In this problem, we showed that the sum of $y_{ja}$ equals the expected discounted time in state $j$ under policy $\beta$. We further proved some equality conditions and derived a set of equations for the expected discounted times. Finally, we argued that an optimal policy with respect to the expected discounted return criterion can be obtained by solving a linear program and defining a policy based on its solutions. The policy $\beta^*$ can be calculated using the proportion of time spent in each action at each state, thus maximizing the expected discounted return and resulting in an optimal policy.

Step by step solution

(a) Proof of Expected Discounted Time in State

Given that $y_{ja} = E_\beta\left[\sum_{n=0}^\infty α^n I_{\{X_n=j, a_n=a\}}\right]$, we aim to show that: $$ \sum_{a} y_{j a}=E\left[\sum_{n=0}^{\infty} \alpha^{n} I_{\left\\{X_{n}=j\right\\}}\right] $$ To show this, we can rewrite the sum of $y_{ja}$ as follows: \[ \begin{aligned} \sum_a y_{ja} &= \sum_a E_\beta\left[\sum_{n=0}^\infty α^n I_{\{X_n=j, a_n=a\}}\right] \\ &= E_\beta\left[\sum_{n=0}^\infty α^n \sum_a I_{\{X_n=j, a_n=a\}}\right] \\ &= E_\beta\left[\sum_{n=0}^\infty α^n I_{\{X_n=j\}}\right] \end{aligned} \] Thus, the result is proved.

(b) Proof of Equalities

We need to show that: 1. $\sum_{j} \sum_{a} y_{j a} =\frac{1}{1-\alpha}$ 2. $\sum_{a} y_{j a} =b_{j}+\alpha \sum_{i} \sum_{a} y_{i a} P_{i j}(a)$ First, we will prove (1) by taking the sum of equation (a): $$ \begin{aligned} \sum_{j} \sum_{a} y_{j a} &= \sum_{j} E\left[\sum_{n=0}^{\infty} \alpha^{n} I_{\left\\{X_{n}=j\right\\}}\right] \\ &= E\left[\sum_{n=0}^{\infty} \alpha^{n} \sum_{j} I_{\left\\{X_{n}=j\right\\}}\right] \\ &= E\left[\sum_{n=0}^{\infty} \alpha^{n}\right] \\ &= \frac{1}{1-\alpha} \end{aligned} $$ Next, we prove (2) using the hint provided: $$ I_{\{X_{n+1}=j\}}=\sum_{i} \sum_{a} I_{\{X_{n-i}, a_{n-a}\}} I_{\{X_{n+1}=j\}} $$ Taking expectations, we get: $$ E\left[I_{\{X_{n+1}=j\}}\right]=\sum_{i} \sum_{a} E\left[I_{\{X_{n-i}, a_{n-a}\}}\right] P_{i j}(a) $$ Now, we can write: $$ \begin{aligned} \sum_{a} y_{j a} &= \sum_{a} E_\beta\left[\sum_{n=0}^\infty \alpha^n I_{\{X_n=j, a_n=a\}}\right] \\ &= E_\beta\left[\alpha \sum_{n=0}^\infty \alpha^n I_{\{X_{n+1}=j\}}\right] \\ &= \alpha E\left[\sum_{n=0}^{\infty} \alpha^{n} I_{\left\\{X_{n+1}=j\right\\}}\right] \\ &= \alpha \sum_{i} \sum_{a} E\left[\sum_{n=0}^{\infty} \alpha^{n} I_{\left\\{X_{n}=i, a_{n}=a\right\\}}\right] P_{i j}(a) \\ &= b_j + \alpha \sum_{i} \sum_{a} y_{i a} P_{i j}(a) \end{aligned} $$ Thus, both equalities are proved.

(c) Interpretation of $y_{ja}$

Given the set of numbers $\{y_{ja}\}$ satisfying: 1. $\sum_{j} \sum_{a} y_{j a} =\frac{1}{1-\alpha}$ 2. $\sum_{a} y_{j a} =b_{j}+\alpha \sum_{i} \sum_{a} y_{i a} P_{i j}(a)$ We need to argue that $y_{ja}$ can be interpreted as the expected discounted time in state $j$ and action $a$ is chosen when the initial state is chosen according to the probabilities $b_{j}$ and the policy $\beta$ is given by: $$ \beta_{i}(a)=\frac{y_{i a}}{\sum_{a} y_{i a}} $$ We derived these equations for the expected discounted times by proving the results in (a) and (b). The fact that $\{y_{ja}\}$ satisfies both equalities and agrees with the problem gives evidence that they can be interpreted as the expected discounted time in state $j$ and action $a$.

(d) Optimal Policy with Linear Program

We will argue that the optimal policy with respect to the expected discounted return criterion can be obtained by first solving the linear program: $$ \begin{array}{ll} \operatorname{maximize} & \sum_{j} \sum_{a} y_{j a} R(j, a), \\ \text { such that } & \sum_{j} \sum_{a} y_{j a}=\frac{1}{1-\alpha}, \\ \sum_{a} y_{j a}=b_{j}+\alpha \sum_{i} \sum_{a} y_{i a} P_{i j}(a) \\ y_{j a} \geqslant 0, \quad \text { all } j, a \end{array} $$ and then defining the policy $\beta^∗$ by $$ \beta_{i}^{*}(a)=\frac{y_{i a}^{*}}{\sum_{a} y_{i a}^{*}} $$ where the $y_{ja}^*$ are the solutions of the linear program. The linear program is set up to maximize the expected discounted return, subject to the constraints derived in parts (b) and (c). By solving the linear program, we obtain a set of values $y_{ja}^*$, which give us the expected discounted time in state $j$ and action $a$. These values can then be used to define an optimal policy $\beta^*$ by calculating the proportion of time spent in each action at each state. This policy should maximize the expected discounted return, and as a result, it is an optimal policy.

Unlock Step-by-Step Solutions & Ace Your Exams!

Full Textbook Solutions
Get detailed explanations and key concepts
Unlimited Al creation
Al flashcards, explanations, exams and more...
Ads-free access
To over 500 millions flashcards
Money-back guarantee
We refund you if you fail your exam.

Start your free trial

Over 30 million students worldwide already upgrade their learning with Vaia!

Key Concepts

These are the key concepts you need to understand to accurately answer the question.

Understanding Expected Discounted Return

The concept of expected discounted return is crucial in understanding how to evaluate policies in a Markov decision problem. It provides a way to measure the cumulative future rewards of a decision, taking into account the fact that the value of rewards typically decreases—or is discounted—over time.

In mathematical terms, the expected discounted return is the expected sum of rewards obtained by following a particular policy, where each reward is discounted according to the time it is received. This discounting is usually represented by a factor $\alpha$, where $0 < \alpha < 1$. The reward received at time $n$ is therefore discounted as $\alpha^n$, which represents its present value.

When dealing with the expected discounted return, we also incorporate the initial probabilities (represented by $b_i$) of starting in a particular state. The whole idea is to prefer policies that not only yield high rewards but also bring those rewards sooner rather than later. This preference is because a dollar today is worth more than a dollar tomorrow due to factors such as inflation and opportunity cost.

Policy Optimization in Markov Decision Problems

The goal of policy optimization in Markov decision problems is to find the best policy—the one that will maximize the expected discounted return over time. In the context of our exercise, this means finding the policy $\beta$ that maximizes the expected sum of discounted rewards from the starting state, and continuing indefinitely.

We use the term 'policy' to refer to a strategy that specifies the action to take in each state. A crucial part of policy optimization involves calculating the long-term values associated with various actions, which we denote as $ y_{ja} $. These values represent the expected discounted time that the process will be in state $j$ and the action $a$ is chosen.

We can further refine our policy through iterations, tweaking it based on the outcomes and expected rewards until we find an optimal or near-optimal policy. This iterative process uses dynamic programming techniques, specifically a method known as 'policy iteration,' where policies are repeatedly improved upon using the value functions calculated in previous steps.

Leveraging Linear Programming for Decision Models

When solving for an optimal policy in Markov decision problems, we can utilize linear programming techniques. This approach is especially useful because it converts the decision-making problem into a mathematical optimization problem with constraints.

Linear programming (LP) involves maximizing or minimizing a linear objective function, subject to a set of linear inequalities or equalities known as constraints. In the context of our Markov decision problem, the linear program is designed to maximize the sum of the expected discounted rewards $ R(j, a) $ for all states $ j $ and actions $ a $.

The constraints are derived from the system's dynamics—specifically, the probabilities of transitioning from one state to another under certain actions. By solving the linear programming problem, we find the $ y_{ja}^* $ values that represent the optimal expected discounted time to be in each state-action pair. These values are then used to craft the optimal policy $ \beta^* $, which dictates the best action to take in each state.

Short Answer

Step by step solution

(a) Proof of Expected Discounted Time in State

(b) Proof of Equalities

(c) Interpretation of \(y_{ja}\)

(d) Optimal Policy with Linear Program

Key Concepts

Understanding Expected Discounted Return

Policy Optimization in Markov Decision Problems

Leveraging Linear Programming for Decision Models

One App. One Place for Learning.

Most popular questions from this chapter

Recommended explanations on Math Textbooks

Statistics

Pure Maths

Logic and Functions

Probability and Statistics

Discrete Mathematics

Theoretical and Mathematical Physics

Study anywhere. Anytime. Across all devices.

Company

Product

Help