CS 285: Lecture 9, Part 2

RAIL17 minutes read

The text discusses using p theta s t to approximate new policy returns accurately by disregarding distribution mismatch and maximizing a bar theta prime to improve the new policy by approximating j theta prime minus j theta with a bar theta prime. The closeness of state marginals and policies is crucial for bounding total variation divergence and calculating expected values under distributions with bounded divergence, impacting the choice of RL algorithm used.

Insights

  • Ignoring distribution mismatch by using p theta s t instead of p theta prime of st accurately approximates the return of a new policy, crucial when policies are close to each other.
  • The total variation divergence between state marginals, bounded by 2 times epsilon times t when policies are close, allows for calculating the expected value under different distributions, influencing the choice of RL algorithm based on epsilon's impact on the error term.

Get key ideas from YouTube videos. It’s free

Recent questions

  • How can policies be approximated accurately?

    Policies can be accurately approximated by using p theta s t instead of p theta prime of st to ignore distribution mismatch, maximizing a bar theta prime to improve the new policy by approximating j theta prime minus j theta with a bar theta prime.

  • What is crucial when policies are close?

    The closeness of p theta of st to p theta prime of st is crucial when pi theta is close to pi theta prime, ensuring that state marginals for theta and theta prime are close if pi theta prime is close to pi theta.

  • How is total variation divergence bounded?

    The total variation divergence between state marginals is bounded by 2 times epsilon times t when policies are close, allowing for a similar bound on state marginals and ensuring that the expected value of functions under distributions with bounded total variation divergence can be calculated.

  • What influences the choice of RL algorithm?

    The error term in the equation scales with epsilon, the total variation divergence between new and old policies, influencing the choice of RL algorithm to use when maximizing the expected value under p theta of the importance weighted advantage to optimize the RL objective.

  • How can improvement in the RL objective be ensured?

    Improvement in the RL objective can be ensured by maximizing the expected value under p theta of the importance weighted advantage, as long as the new policy, pi theta prime, remains close to the original policy, pi theta, in terms of total variation divergence. Taking the derivative of this objective with respect to theta prime yields the policy gradient, ensuring improvement in the RL objective if theta prime remains near theta.

Related videos

Summary

00:00

Approximating Policy Returns with State Marginals

  • Using p theta s t instead of p theta prime of st can accurately approximate the return of a new policy by ignoring distribution mismatch.
  • The objective is to maximize a bar theta prime to improve the new policy by approximating j theta prime minus j theta with a bar theta prime.
  • The closeness of p theta of st to p theta prime of st is crucial when pi theta is close to pi theta prime.
  • For a deterministic policy pi theta, the state marginals for theta and theta prime are close if pi theta prime is close to pi theta.
  • The total variation divergence between state marginals is bounded by 2 times epsilon times t when policies are close.
  • The total variation divergence between pi theta prime and pi theta being bounded by epsilon allows for a similar bound on state marginals.
  • The state marginals differ by at most 2 times epsilon t when policies are close.
  • The expected value of functions under distributions with bounded total variation divergence can be calculated to relate the objectives expressed in advantage values.
  • The expected value under p theta prime of st of a function is bounded by the expected value under p theta of st of the same function minus an error term.
  • The error term is 2 times epsilon times t times a constant c, where c is the largest possible reward times the number of time steps.

16:20

Optimizing RL Objective with Policy Gradient

  • One over one minus gamma represents the time horizon in the context of maximizing a bound on the expectation under p theta prime, which is equivalent to maximizing the RL objective. The error term in the equation scales with epsilon, the total variation divergence between new and old policies, influencing the choice of RL algorithm to use.
  • Maximizing the expected value under p theta of the importance weighted advantage is a reliable method to optimize the RL objective, as long as the new policy, pi theta prime, remains close to the original policy, pi theta, in terms of total variation divergence. Taking the derivative of this objective with respect to theta prime yields the policy gradient, ensuring improvement in the RL objective if theta prime remains near theta.
Channel avatarChannel avatarChannel avatarChannel avatarChannel avatar

Try it yourself — It’s free.