CS 285: Lecture 9, Part 2
RAIL・17 minutes read
The text discusses using p theta s t to approximate new policy returns accurately by disregarding distribution mismatch and maximizing a bar theta prime to improve the new policy by approximating j theta prime minus j theta with a bar theta prime. The closeness of state marginals and policies is crucial for bounding total variation divergence and calculating expected values under distributions with bounded divergence, impacting the choice of RL algorithm used.
Insights
- Ignoring distribution mismatch by using p theta s t instead of p theta prime of st accurately approximates the return of a new policy, crucial when policies are close to each other.
- The total variation divergence between state marginals, bounded by 2 times epsilon times t when policies are close, allows for calculating the expected value under different distributions, influencing the choice of RL algorithm based on epsilon's impact on the error term.
Get key ideas from YouTube videos. It’s free
Recent questions
How can policies be approximated accurately?
Policies can be accurately approximated by using p theta s t instead of p theta prime of st to ignore distribution mismatch, maximizing a bar theta prime to improve the new policy by approximating j theta prime minus j theta with a bar theta prime.
What is crucial when policies are close?
The closeness of p theta of st to p theta prime of st is crucial when pi theta is close to pi theta prime, ensuring that state marginals for theta and theta prime are close if pi theta prime is close to pi theta.
How is total variation divergence bounded?
The total variation divergence between state marginals is bounded by 2 times epsilon times t when policies are close, allowing for a similar bound on state marginals and ensuring that the expected value of functions under distributions with bounded total variation divergence can be calculated.
What influences the choice of RL algorithm?
The error term in the equation scales with epsilon, the total variation divergence between new and old policies, influencing the choice of RL algorithm to use when maximizing the expected value under p theta of the importance weighted advantage to optimize the RL objective.
How can improvement in the RL objective be ensured?
Improvement in the RL objective can be ensured by maximizing the expected value under p theta of the importance weighted advantage, as long as the new policy, pi theta prime, remains close to the original policy, pi theta, in terms of total variation divergence. Taking the derivative of this objective with respect to theta prime yields the policy gradient, ensuring improvement in the RL objective if theta prime remains near theta.
Related videos
귓속말의 자막나라
[한글자막] CS 285: Lecture 9, Part 2
Wolfram
A conversation between Nassim Nicholas Taleb and Stephen Wolfram at the Wolfram Summer School 2021
Harvard University
Lecture 4: Conditional Probability | Statistics 110
3Blue1Brown
Bayes theorem, the geometry of changing beliefs
Harvard University
Lecture 5: Conditioning Continued, Law of Total Probability | Statistics 110