[한글자막] CS 285: Lecture 9, Part 2

귓속말의 자막나라17 minutes read

Using P Theta St instead of P Theta Prime of St can accurately approximate the return of a new policy if the distribution mismatch is ignored. Maximizing the objective of the expected value under P Theta or Pi Theta of the importance-weighted advantage is crucial in reinforcement learning.

Insights

  • Using P Theta St instead of P Theta Prime of St provides an accurate approximation of a new policy's return, especially when disregarding distribution mismatches.
  • Maximizing the expected value under P Theta or Pi Theta of the importance-weighted advantage is crucial in reinforcement learning, emphasizing the need for the new policy (Pi Theta Prime) to closely align with the original policy (Pi Theta) in terms of total variation divergence to achieve optimal performance.

Get key ideas from YouTube videos. It’s free

Recent questions

  • How can P Theta St approximate new policy returns?

    By using P Theta St instead of P Theta Prime of St, the return of a new policy can be accurately approximated if the distribution mismatch is ignored. This method helps in maximizing a bar Theta Prime to improve the new policy by approximating J Theta Prime minus J Theta with a bar Theta Prime.

  • What is the relationship between Pi Theta and Pi Theta Prime?

    For a deterministic policy Pi Theta, the state marginals for Theta and Theta Prime are close if Pi Theta Prime is close to Pi Theta. This closeness leads to a bounded difference between P Theta of St and P Theta Prime of St when Pi Theta is close to Pi Theta Prime.

  • How is total variation Divergence bounded for stochastic policies?

    The total variation Divergence between state marginals is bounded by 2 * Epsilon * T for stochastic policies as Epsilon decreases. This bound ensures that the expected value under P Theta Prime of a function F of St is limited by the expected value under P Theta minus 2 * Epsilon * T * the maximum value of F.

  • What is the significance of the cost function in reinforcement learning?

    The cost function (C) in reinforcement learning is determined by the number of time steps multiplied by the maximum reward (R Max). In cases with infinite time steps and a discount factor, the sum of discount values must equal 1 over 1 minus the discount factor (gamma). Maximizing the objective of the expected value under P Theta or Pi Theta of the importance-weighted advantage is crucial for optimal performance.

  • How does policy gradient improve reinforcement learning objectives?

    Maximizing the objective of the expected value under P Theta or Pi Theta of the importance-weighted advantage is crucial in reinforcement learning. Taking the derivative of the objective with respect to Theta Prime, which affects the importance weight, results in the policy gradient. This process ultimately improves the RL objective as long as Theta Prime remains near Theta, ensuring optimal performance by keeping the new policy close to the original policy in terms of total variation divergence.

Related videos

Summary

00:00

Approximating Policy Return with Total Variation Divergence

  • Using P Theta St instead of P Theta Prime of St can accurately approximate the return of a new policy if the distribution mismatch is ignored.
  • The objective is to maximize a bar Theta Prime to get a better new policy by approximating J Theta Prime minus J Theta with a bar Theta Prime.
  • P Theta of St is close to P Theta Prime of St when Pi Theta is close to Pi Theta Prime, leading to a bounded difference between them.
  • For a deterministic policy Pi Theta, the state marginals for Theta and Theta Prime are close if Pi Theta Prime is close to Pi Theta.
  • The total variation Divergence between state marginals is bounded by 2 * Epsilon * T as Epsilon decreases.
  • For arbitrary distributions Pi Theta, the total variation Divergence between Pi Theta Prime and Pi Theta is bounded by Epsilon for all states St.
  • A useful Lemma states that if two distributions have a total variation Divergence of Epsilon, a joint distribution can be constructed with a probability of agreement of 1 - Epsilon.
  • The total variation Divergence between state marginals is bounded by 2 * Epsilon * T for stochastic policies.
  • The expected value under P Theta Prime of a function F of St is bounded by the expected value under P Theta minus 2 * Epsilon * T * the maximum value of F.
  • The expected value of the important sampled estimator for the advantage under P Theta Prime can be bounded below by the same quantity under P Theta minus 2 * Epsilon * T * a constant C, where C is the largest possible reward times the number of time steps.

15:53

Optimizing Reinforcement Learning with Policy Gradient

  • The cost function (C) is determined by the number of time steps multiplied by the maximum reward (R Max). In cases with infinite time steps and a discount factor, the sum of discount values must equal 1 over 1 minus the discount factor (gamma). For a finite horizon case, C equals the product of the total time steps and R Max, while in an infinite horizon scenario, it equals R Max divided by 1 minus gamma.
  • Maximizing the objective of the expected value under P Theta or Pi Theta of the importance-weighted advantage is crucial in reinforcement learning. Ensuring that the new policy (Pi Theta Prime) remains close to the original policy (Pi Theta) in terms of total variation divergence is essential for achieving optimal performance. Taking the derivative of the objective with respect to Theta Prime, which affects the importance weight, results in the policy gradient, ultimately improving the RL objective as long as Theta Prime remains near Theta.
Channel avatarChannel avatarChannel avatarChannel avatarChannel avatar

Try it yourself — It’s free.