Warning: foreach() argument must be of type array|object, bool given in /var/www/html/web/app/themes/studypress-core-theme/template-parts/header/mobile-offcanvas.php on line 20

In a Markov decision problem, another criterion often used, different than the expected average return per unit time, is that of the expected discounted return. In this criterion we choose a number \(\alpha, 0<\alpha<1\), and try to choose a policy so as to maximize \(E\left[\sum_{i=0}^{\infty} \alpha^{i} R\left(X_{i}, a_{i}\right)\right]\) (that is, rewards at time \(n\) are discounted at rate \(\left.\alpha^{n}\right)\) Suppose that the initial state is chosen according to the probabilities \(b_{i} .\) That is, $$ P\left\\{X_{0}=i\right\\}=b_{i}, \quad i=1, \ldots, n $$ For a given policy \(\beta\) let \(y_{j a}\) denote the expected discounted time that the process is in state \(j\) and action \(a\) is chosen. That is, $$ y_{j a}=E_{\beta}\left[\sum_{n=0}^{\infty} \alpha^{n} I_{\left[X_{n}=j, a_{n}=a\right\\}}\right] $$ where for any event \(A\) the indicator variable \(I_{A}\) is defined by $$ I_{A}=\left\\{\begin{array}{ll} 1, & \text { if } A \text { occurs } \\ 0, & \text { otherwise } \end{array}\right. $$ (a) Show that $$ \sum_{a} y_{j a}=E\left[\sum_{n=0}^{\infty} \alpha^{n} I_{\left\\{X_{n}=j\right\\}}\right] $$ or, in other words, \(\sum_{a} y_{j a}\) is the expected discounted time in state \(j\) under \(\beta\). (b) Show that $$ \begin{aligned} \sum_{j} \sum_{a} y_{j a} &=\frac{1}{1-\alpha}, \\ \sum_{a} y_{j a} &=b_{j}+\alpha \sum_{i} \sum_{a} y_{i a} P_{i j}(a) \end{aligned} $$ Hint: For the second equation, use the identity $$ I_{\left\\{X_{n+1}=j\right\\}}=\sum_{i} \sum_{a} I_{\left\\{X_{n-i}, a_{n-a}\right\rangle} I_{\left\\{X_{n+1}=j\right\\}} $$ Take expectations of the preceding to obtain $$ E\left[I_{\left.X_{n+1}=j\right\\}}\right]=\sum_{i} \sum_{a} E\left[I_{\left\\{X_{n-i}, a_{n-a}\right\\}}\right] P_{i j}(a) $$ (c) Let \(\left\\{y_{j a}\right\\}\) be a set of numbers satisfying $$ \begin{aligned} \sum_{j} \sum_{a} y_{j a} &=\frac{1}{1-\alpha}, \\ \sum_{a} y_{j a} &=b_{j}+\alpha \sum_{i} \sum_{a} y_{i a} P_{i j}(a) \end{aligned} $$ Argue that \(y_{j a}\) can be interpreted as the expected discounted time that the process is in state \(j\) and action \(a\) is chosen when the initial state is chosen according to the probabilities \(b_{j}\) and the policy \(\beta\), given by $$ \beta_{i}(a)=\frac{y_{i a}}{\sum_{a} y_{i a}} $$ is employed. Hint: Derive a set of equations for the expected discounted times when policy \(\beta\) is used and show that they are equivalent to Equation \((4.38) .\) (d) Argue that an optimal policy with respect to the expected discounted return criterion can be obtained by first solving the linear program $$ \begin{array}{ll} \operatorname{maximize} & \sum_{j} \sum_{a} y_{j a} R(j, a), \\ \text { such that } & \sum_{j} \sum_{a} y_{j a}=\frac{1}{1-\alpha}, \\ \sum_{a} y_{j a}=b_{j}+\alpha \sum_{i} \sum_{a} y_{i a} P_{i j}(a) \\ y_{j a} \geqslant 0, \quad \text { all } j, a \end{array} $$ and then defining the policy \(\beta^{*}\) by $$ \beta_{i}^{*}(a)=\frac{y_{i a}^{*}}{\sum_{a} y_{i a}^{*}} $$ where the \(y_{j a}^{*}\) are the solutions of the linear program.

Short Answer

Expert verified
In this problem, we showed that the sum of \(y_{ja}\) equals the expected discounted time in state \(j\) under policy \(\beta\). We further proved some equality conditions and derived a set of equations for the expected discounted times. Finally, we argued that an optimal policy with respect to the expected discounted return criterion can be obtained by solving a linear program and defining a policy based on its solutions. The policy \(\beta^*\) can be calculated using the proportion of time spent in each action at each state, thus maximizing the expected discounted return and resulting in an optimal policy.

Step by step solution

01

(a) Proof of Expected Discounted Time in State

Given that \(y_{ja} = E_\beta\left[\sum_{n=0}^\infty α^n I_{\{X_n=j, a_n=a\}}\right]\), we aim to show that: $$ \sum_{a} y_{j a}=E\left[\sum_{n=0}^{\infty} \alpha^{n} I_{\left\\{X_{n}=j\right\\}}\right] $$ To show this, we can rewrite the sum of \(y_{ja}\) as follows: \[ \begin{aligned} \sum_a y_{ja} &= \sum_a E_\beta\left[\sum_{n=0}^\infty α^n I_{\{X_n=j, a_n=a\}}\right] \\ &= E_\beta\left[\sum_{n=0}^\infty α^n \sum_a I_{\{X_n=j, a_n=a\}}\right] \\ &= E_\beta\left[\sum_{n=0}^\infty α^n I_{\{X_n=j\}}\right] \end{aligned} \] Thus, the result is proved.
02

(b) Proof of Equalities

We need to show that: 1. \(\sum_{j} \sum_{a} y_{j a} =\frac{1}{1-\alpha}\) 2. \(\sum_{a} y_{j a} =b_{j}+\alpha \sum_{i} \sum_{a} y_{i a} P_{i j}(a)\) First, we will prove (1) by taking the sum of equation (a): $$ \begin{aligned} \sum_{j} \sum_{a} y_{j a} &= \sum_{j} E\left[\sum_{n=0}^{\infty} \alpha^{n} I_{\left\\{X_{n}=j\right\\}}\right] \\ &= E\left[\sum_{n=0}^{\infty} \alpha^{n} \sum_{j} I_{\left\\{X_{n}=j\right\\}}\right] \\ &= E\left[\sum_{n=0}^{\infty} \alpha^{n}\right] \\ &= \frac{1}{1-\alpha} \end{aligned} $$ Next, we prove (2) using the hint provided: $$ I_{\{X_{n+1}=j\}}=\sum_{i} \sum_{a} I_{\{X_{n-i}, a_{n-a}\}} I_{\{X_{n+1}=j\}} $$ Taking expectations, we get: $$ E\left[I_{\{X_{n+1}=j\}}\right]=\sum_{i} \sum_{a} E\left[I_{\{X_{n-i}, a_{n-a}\}}\right] P_{i j}(a) $$ Now, we can write: $$ \begin{aligned} \sum_{a} y_{j a} &= \sum_{a} E_\beta\left[\sum_{n=0}^\infty \alpha^n I_{\{X_n=j, a_n=a\}}\right] \\ &= E_\beta\left[\alpha \sum_{n=0}^\infty \alpha^n I_{\{X_{n+1}=j\}}\right] \\ &= \alpha E\left[\sum_{n=0}^{\infty} \alpha^{n} I_{\left\\{X_{n+1}=j\right\\}}\right] \\ &= \alpha \sum_{i} \sum_{a} E\left[\sum_{n=0}^{\infty} \alpha^{n} I_{\left\\{X_{n}=i, a_{n}=a\right\\}}\right] P_{i j}(a) \\ &= b_j + \alpha \sum_{i} \sum_{a} y_{i a} P_{i j}(a) \end{aligned} $$ Thus, both equalities are proved.
03

(c) Interpretation of \(y_{ja}\)

Given the set of numbers \(\{y_{ja}\}\) satisfying: 1. \(\sum_{j} \sum_{a} y_{j a} =\frac{1}{1-\alpha}\) 2. \(\sum_{a} y_{j a} =b_{j}+\alpha \sum_{i} \sum_{a} y_{i a} P_{i j}(a)\) We need to argue that \(y_{ja}\) can be interpreted as the expected discounted time in state \(j\) and action \(a\) is chosen when the initial state is chosen according to the probabilities \(b_{j}\) and the policy \(\beta\) is given by: $$ \beta_{i}(a)=\frac{y_{i a}}{\sum_{a} y_{i a}} $$ We derived these equations for the expected discounted times by proving the results in (a) and (b). The fact that \(\{y_{ja}\}\) satisfies both equalities and agrees with the problem gives evidence that they can be interpreted as the expected discounted time in state \(j\) and action \(a\).
04

(d) Optimal Policy with Linear Program

We will argue that the optimal policy with respect to the expected discounted return criterion can be obtained by first solving the linear program: $$ \begin{array}{ll} \operatorname{maximize} & \sum_{j} \sum_{a} y_{j a} R(j, a), \\ \text { such that } & \sum_{j} \sum_{a} y_{j a}=\frac{1}{1-\alpha}, \\ \sum_{a} y_{j a}=b_{j}+\alpha \sum_{i} \sum_{a} y_{i a} P_{i j}(a) \\ y_{j a} \geqslant 0, \quad \text { all } j, a \end{array} $$ and then defining the policy \(\beta^∗\) by $$ \beta_{i}^{*}(a)=\frac{y_{i a}^{*}}{\sum_{a} y_{i a}^{*}} $$ where the \(y_{ja}^*\) are the solutions of the linear program. The linear program is set up to maximize the expected discounted return, subject to the constraints derived in parts (b) and (c). By solving the linear program, we obtain a set of values \(y_{ja}^*\), which give us the expected discounted time in state \(j\) and action \(a\). These values can then be used to define an optimal policy \(\beta^*\) by calculating the proportion of time spent in each action at each state. This policy should maximize the expected discounted return, and as a result, it is an optimal policy.

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with Vaia!

Key Concepts

These are the key concepts you need to understand to accurately answer the question.

Understanding Expected Discounted Return
The concept of expected discounted return is crucial in understanding how to evaluate policies in a Markov decision problem. It provides a way to measure the cumulative future rewards of a decision, taking into account the fact that the value of rewards typically decreases—or is discounted—over time.

In mathematical terms, the expected discounted return is the expected sum of rewards obtained by following a particular policy, where each reward is discounted according to the time it is received. This discounting is usually represented by a factor \(\alpha\), where \(0 < \alpha < 1\). The reward received at time \(n\) is therefore discounted as \(\alpha^n\), which represents its present value.

When dealing with the expected discounted return, we also incorporate the initial probabilities (represented by \(b_i\)) of starting in a particular state. The whole idea is to prefer policies that not only yield high rewards but also bring those rewards sooner rather than later. This preference is because a dollar today is worth more than a dollar tomorrow due to factors such as inflation and opportunity cost.
Policy Optimization in Markov Decision Problems
The goal of policy optimization in Markov decision problems is to find the best policy—the one that will maximize the expected discounted return over time. In the context of our exercise, this means finding the policy \(\beta\) that maximizes the expected sum of discounted rewards from the starting state, and continuing indefinitely.

We use the term 'policy' to refer to a strategy that specifies the action to take in each state. A crucial part of policy optimization involves calculating the long-term values associated with various actions, which we denote as \( y_{ja} \). These values represent the expected discounted time that the process will be in state \(j\) and the action \(a\) is chosen.

We can further refine our policy through iterations, tweaking it based on the outcomes and expected rewards until we find an optimal or near-optimal policy. This iterative process uses dynamic programming techniques, specifically a method known as 'policy iteration,' where policies are repeatedly improved upon using the value functions calculated in previous steps.
Leveraging Linear Programming for Decision Models
When solving for an optimal policy in Markov decision problems, we can utilize linear programming techniques. This approach is especially useful because it converts the decision-making problem into a mathematical optimization problem with constraints.

Linear programming (LP) involves maximizing or minimizing a linear objective function, subject to a set of linear inequalities or equalities known as constraints. In the context of our Markov decision problem, the linear program is designed to maximize the sum of the expected discounted rewards \( R(j, a) \) for all states \( j \) and actions \( a \).

The constraints are derived from the system's dynamics—specifically, the probabilities of transitioning from one state to another under certain actions. By solving the linear programming problem, we find the \( y_{ja}^* \) values that represent the optimal expected discounted time to be in each state-action pair. These values are then used to craft the optimal policy \( \beta^* \), which dictates the best action to take in each state.

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

A DNA nucleotide has any of four values. A standard model for a mutational change of the nucleotide at a specific location is a Markov chain model that supposes that in going from period to period the nucleotide does not change with probability \(1-3 \alpha\), and if it does change then it is equally likely to change to any of the other three values, for some \(0<\alpha<\frac{1}{3}\). (a) Show that \(P_{1,1}^{n}=\frac{1}{4}+\frac{3}{4}(1-4 \alpha)^{n}\). (b) What is the long-run proportion of time the chain is in each state?

A flea moves around the vertices of a triangle in the following manner: Whenever it is at vertex \(i\) it moves to its clockwise neighbor vertex with probability \(p_{i}\) and to the counterclockwise neighbor with probability \(q_{i}=1-p_{i}, i=1,2,3\). (a) Find the proportion of time that the flea is at each of the vertices. (b) How often does the flea make a counterclockwise move that is then followed by five consecutive clockwise moves?

Let \(Y_{n}\) be the sum of \(n\) independent rolls of a fair die. Find \(\lim _{n \rightarrow \infty} P\left\\{Y_{n}\right.\) is a multiple of 13\(\\}\) Hint: Define an appropriate Markov chain and apply the results of Exercise \(20 .\)

Consider a population of individuals each of whom possesses two genes that can be either type \(A\) or type \(a\). Suppose that in outward appearance type \(A\) is dominant and type \(a\) is recessive. (That is, an individual will have only the outward characteristics of the recessive gene if its pair is aa.) Suppose that the population has stabilized, and the percentages of individuals having respective gene pairs \(A A, a a\), and \(A a\) are \(p, q\), and \(r .\) Call an individual dominant or recessive depending on the outward characteristics it exhibits. Let \(S_{11}\) denote the probability that an offspring of two dominant parents will be recessive; and let \(S_{10}\) denote the probability that the offspring of one dominant and one recessive parent will be recessive. Compute \(S_{11}\) and \(S_{10}\) to show that \(S_{11}=S_{10}^{2} .\) (The quantities \(S_{10}\) and \(S_{11}\) are known in the genetics literature as Snyder's ratios.)

Let \(\left\\{X_{n}, n \geqslant 0\right\\}\) denote an ergodic Markov chain with limiting probabilities \(\pi_{i} .\) Define the process \(\left\\{Y_{n}, n \geqslant 1\right\\}\) by \(Y_{n}=\left(X_{n-1}, X_{n}\right)\). That is, \(Y_{n}\) keeps track of the last two states of the original chain. Is \(\left\\{Y_{n}, n \geqslant 1\right\\}\) a Markov chain? If so, determine its transition probabilities and find $$ \lim _{n \rightarrow \infty} P\left\\{Y_{n}=(i, j)\right\\} $$

See all solutions

Recommended explanations on Math Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free