How is the policy gradient calculated in REINFORCE? Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Announcing the arrival of Valued Associate #679: Cesar Manara Unicorn Meta Zoo #1: Why another podcast?Implementing experience replay in reinforcement learningWhy does the discount rate in the REINFORCE algorithm appears twice?Questions about n-step tree backup algorithmImportance Sampling Ratio ProbabilityWhat is the meaning of Model(s, a) in the prioritized sweeping algorithm?Impact of Varying Length Trajectories on Policy Gradient OptimizationWhy is the state value function sufficient to determine the policy if a model is available?Is REINFORCE the same as 'vanilla policy gradient'?Difficulty understanding Monte Carlo policy evaluation (state-value) for gridworldPolicy gradient loss for neural network training

Simple Line in LaTeX Help!

French equivalents of おしゃれは足元から (Every good outfit starts with the shoes)

NIntegrate on a solution of a matrix ODE

Noise in Eigenvalues plot

How to achieve cat-like agility?

.bashrc alias for a command with fixed second parameter

How to resize main filesystem

How do I say "this must not happen"?

How can I prevent/balance waiting and turtling as a response to cooldown mechanics

Can two people see the same photon?

First paper to introduce the "principal-agent problem"

malloc in main() or malloc in another function: allocating memory for a struct and its members

Does the universe have a fixed centre of mass?

Marquee sign letters

Is there night in Alpha Complex?

Understanding piped commands in GNU/Linux

What is a more techy Technical Writer job title that isn't cutesy or confusing?

Besides transaction validation, are there any other uses of the Script language in Bitcoin

Where did Ptolemy compare the Earth to the distance of fixed stars?

Dinosaur Word Search, Letter Solve, and Unscramble

Is the Mordenkainen's Sword spell underpowered?

Why not use the yoke to control yaw, as well as pitch and roll?

Why are current probes so expensive?

What should one know about term logic before studying propositional and predicate logic?



How is the policy gradient calculated in REINFORCE?



Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
Announcing the arrival of Valued Associate #679: Cesar Manara
Unicorn Meta Zoo #1: Why another podcast?Implementing experience replay in reinforcement learningWhy does the discount rate in the REINFORCE algorithm appears twice?Questions about n-step tree backup algorithmImportance Sampling Ratio ProbabilityWhat is the meaning of Model(s, a) in the prioritized sweeping algorithm?Impact of Varying Length Trajectories on Policy Gradient OptimizationWhy is the state value function sufficient to determine the policy if a model is available?Is REINFORCE the same as 'vanilla policy gradient'?Difficulty understanding Monte Carlo policy evaluation (state-value) for gridworldPolicy gradient loss for neural network training



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








4












$begingroup$


Reading Sutton and Barto, I see the following in describing policy gradients:



policy grad



How is the gradient calculated with respect to an action (taken at time t)? I've read implementations of the algorithm, but conceptually I'm not sure I understand how the gradient is computed, since we need some loss function to compute the gradient.



I've seen a good PyTorch article, but I still don't understand the meaning of this gradient conceptually, and I don't know what I'm looking to implement. Any intuition that you could provide would be helpful.










share|improve this question











$endgroup$











  • $begingroup$
    Are you asking if the gradient is with respect to an action? If yes, then, no, the gradient is not with respect to an action, but with respect to the parameters (of e.g. the neural network representing your policy), $theta_t$. Conceptually, you don't need any "loss" to compute the gradient. You just need a multivariable differentiable function. In this case, the multivariable function is your parametrised policy $pi(A_t mid S_t, theta_t)$.
    $endgroup$
    – nbro
    9 hours ago











  • $begingroup$
    @nbro Conceptually I understand that, but the action is the output of the policy. So what I don’t understand is how conceptually we are determining the gradient of the policy with respect to the action it output? Further if the action is making a choice, that wouldn’t be differentiable. I looked at the implementation that used an action probability instead. But I’m still unsure what is the gradient of the policy with respect to the output of the policy. It doesn’t make sense to me yet.
    $endgroup$
    – Hanzy
    9 hours ago










  • $begingroup$
    @nbro I guess I see that it’s technically the gradient of the probability of selecting an action At. But still not sure how to implement that. We want to know how the probability would change with respect to changing the parameters. Maybe I’m starting to get a little insight.
    $endgroup$
    – Hanzy
    8 hours ago






  • 1




    $begingroup$
    I think this is just a notation or terminology issue. To train your neural network representing $pi$, you will need a loss function (that assesses the quality of the output action), yes, but this is not explicit in Barto and Sutton's book (at least in those equations), which just states that the gradient is with respect to the parameters (variables) of the function. Barto and Sutton just present the general idea. Have a look at this reference implementation: github.com/pytorch/examples/blob/master/reinforcement_learning/….
    $endgroup$
    – nbro
    8 hours ago











  • $begingroup$
    @nbro thanks for the link; I actually think I’m almost there. In this implementation we take the (negative) log probability and multiply it by the return. We moved the return inside the gradient since it’s not dependent on $theta$. Then we sum over the list of returns and use the summing function as the function over which we compute the gradient. So the gradient is the SUM of all future rewards wrt to theta (which, in turn, effects the choice of action). Is this correct? (I’m going back to read it again now).
    $endgroup$
    – Hanzy
    4 hours ago

















4












$begingroup$


Reading Sutton and Barto, I see the following in describing policy gradients:



policy grad



How is the gradient calculated with respect to an action (taken at time t)? I've read implementations of the algorithm, but conceptually I'm not sure I understand how the gradient is computed, since we need some loss function to compute the gradient.



I've seen a good PyTorch article, but I still don't understand the meaning of this gradient conceptually, and I don't know what I'm looking to implement. Any intuition that you could provide would be helpful.










share|improve this question











$endgroup$











  • $begingroup$
    Are you asking if the gradient is with respect to an action? If yes, then, no, the gradient is not with respect to an action, but with respect to the parameters (of e.g. the neural network representing your policy), $theta_t$. Conceptually, you don't need any "loss" to compute the gradient. You just need a multivariable differentiable function. In this case, the multivariable function is your parametrised policy $pi(A_t mid S_t, theta_t)$.
    $endgroup$
    – nbro
    9 hours ago











  • $begingroup$
    @nbro Conceptually I understand that, but the action is the output of the policy. So what I don’t understand is how conceptually we are determining the gradient of the policy with respect to the action it output? Further if the action is making a choice, that wouldn’t be differentiable. I looked at the implementation that used an action probability instead. But I’m still unsure what is the gradient of the policy with respect to the output of the policy. It doesn’t make sense to me yet.
    $endgroup$
    – Hanzy
    9 hours ago










  • $begingroup$
    @nbro I guess I see that it’s technically the gradient of the probability of selecting an action At. But still not sure how to implement that. We want to know how the probability would change with respect to changing the parameters. Maybe I’m starting to get a little insight.
    $endgroup$
    – Hanzy
    8 hours ago






  • 1




    $begingroup$
    I think this is just a notation or terminology issue. To train your neural network representing $pi$, you will need a loss function (that assesses the quality of the output action), yes, but this is not explicit in Barto and Sutton's book (at least in those equations), which just states that the gradient is with respect to the parameters (variables) of the function. Barto and Sutton just present the general idea. Have a look at this reference implementation: github.com/pytorch/examples/blob/master/reinforcement_learning/….
    $endgroup$
    – nbro
    8 hours ago











  • $begingroup$
    @nbro thanks for the link; I actually think I’m almost there. In this implementation we take the (negative) log probability and multiply it by the return. We moved the return inside the gradient since it’s not dependent on $theta$. Then we sum over the list of returns and use the summing function as the function over which we compute the gradient. So the gradient is the SUM of all future rewards wrt to theta (which, in turn, effects the choice of action). Is this correct? (I’m going back to read it again now).
    $endgroup$
    – Hanzy
    4 hours ago













4












4








4


1



$begingroup$


Reading Sutton and Barto, I see the following in describing policy gradients:



policy grad



How is the gradient calculated with respect to an action (taken at time t)? I've read implementations of the algorithm, but conceptually I'm not sure I understand how the gradient is computed, since we need some loss function to compute the gradient.



I've seen a good PyTorch article, but I still don't understand the meaning of this gradient conceptually, and I don't know what I'm looking to implement. Any intuition that you could provide would be helpful.










share|improve this question











$endgroup$




Reading Sutton and Barto, I see the following in describing policy gradients:



policy grad



How is the gradient calculated with respect to an action (taken at time t)? I've read implementations of the algorithm, but conceptually I'm not sure I understand how the gradient is computed, since we need some loss function to compute the gradient.



I've seen a good PyTorch article, but I still don't understand the meaning of this gradient conceptually, and I don't know what I'm looking to implement. Any intuition that you could provide would be helpful.







reinforcement-learning policy-gradients rl-an-introduction notation reinforce






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 1 hour ago









Philip Raeisghasem

1,169121




1,169121










asked 10 hours ago









HanzyHanzy

1516




1516











  • $begingroup$
    Are you asking if the gradient is with respect to an action? If yes, then, no, the gradient is not with respect to an action, but with respect to the parameters (of e.g. the neural network representing your policy), $theta_t$. Conceptually, you don't need any "loss" to compute the gradient. You just need a multivariable differentiable function. In this case, the multivariable function is your parametrised policy $pi(A_t mid S_t, theta_t)$.
    $endgroup$
    – nbro
    9 hours ago











  • $begingroup$
    @nbro Conceptually I understand that, but the action is the output of the policy. So what I don’t understand is how conceptually we are determining the gradient of the policy with respect to the action it output? Further if the action is making a choice, that wouldn’t be differentiable. I looked at the implementation that used an action probability instead. But I’m still unsure what is the gradient of the policy with respect to the output of the policy. It doesn’t make sense to me yet.
    $endgroup$
    – Hanzy
    9 hours ago










  • $begingroup$
    @nbro I guess I see that it’s technically the gradient of the probability of selecting an action At. But still not sure how to implement that. We want to know how the probability would change with respect to changing the parameters. Maybe I’m starting to get a little insight.
    $endgroup$
    – Hanzy
    8 hours ago






  • 1




    $begingroup$
    I think this is just a notation or terminology issue. To train your neural network representing $pi$, you will need a loss function (that assesses the quality of the output action), yes, but this is not explicit in Barto and Sutton's book (at least in those equations), which just states that the gradient is with respect to the parameters (variables) of the function. Barto and Sutton just present the general idea. Have a look at this reference implementation: github.com/pytorch/examples/blob/master/reinforcement_learning/….
    $endgroup$
    – nbro
    8 hours ago











  • $begingroup$
    @nbro thanks for the link; I actually think I’m almost there. In this implementation we take the (negative) log probability and multiply it by the return. We moved the return inside the gradient since it’s not dependent on $theta$. Then we sum over the list of returns and use the summing function as the function over which we compute the gradient. So the gradient is the SUM of all future rewards wrt to theta (which, in turn, effects the choice of action). Is this correct? (I’m going back to read it again now).
    $endgroup$
    – Hanzy
    4 hours ago
















  • $begingroup$
    Are you asking if the gradient is with respect to an action? If yes, then, no, the gradient is not with respect to an action, but with respect to the parameters (of e.g. the neural network representing your policy), $theta_t$. Conceptually, you don't need any "loss" to compute the gradient. You just need a multivariable differentiable function. In this case, the multivariable function is your parametrised policy $pi(A_t mid S_t, theta_t)$.
    $endgroup$
    – nbro
    9 hours ago











  • $begingroup$
    @nbro Conceptually I understand that, but the action is the output of the policy. So what I don’t understand is how conceptually we are determining the gradient of the policy with respect to the action it output? Further if the action is making a choice, that wouldn’t be differentiable. I looked at the implementation that used an action probability instead. But I’m still unsure what is the gradient of the policy with respect to the output of the policy. It doesn’t make sense to me yet.
    $endgroup$
    – Hanzy
    9 hours ago










  • $begingroup$
    @nbro I guess I see that it’s technically the gradient of the probability of selecting an action At. But still not sure how to implement that. We want to know how the probability would change with respect to changing the parameters. Maybe I’m starting to get a little insight.
    $endgroup$
    – Hanzy
    8 hours ago






  • 1




    $begingroup$
    I think this is just a notation or terminology issue. To train your neural network representing $pi$, you will need a loss function (that assesses the quality of the output action), yes, but this is not explicit in Barto and Sutton's book (at least in those equations), which just states that the gradient is with respect to the parameters (variables) of the function. Barto and Sutton just present the general idea. Have a look at this reference implementation: github.com/pytorch/examples/blob/master/reinforcement_learning/….
    $endgroup$
    – nbro
    8 hours ago











  • $begingroup$
    @nbro thanks for the link; I actually think I’m almost there. In this implementation we take the (negative) log probability and multiply it by the return. We moved the return inside the gradient since it’s not dependent on $theta$. Then we sum over the list of returns and use the summing function as the function over which we compute the gradient. So the gradient is the SUM of all future rewards wrt to theta (which, in turn, effects the choice of action). Is this correct? (I’m going back to read it again now).
    $endgroup$
    – Hanzy
    4 hours ago















$begingroup$
Are you asking if the gradient is with respect to an action? If yes, then, no, the gradient is not with respect to an action, but with respect to the parameters (of e.g. the neural network representing your policy), $theta_t$. Conceptually, you don't need any "loss" to compute the gradient. You just need a multivariable differentiable function. In this case, the multivariable function is your parametrised policy $pi(A_t mid S_t, theta_t)$.
$endgroup$
– nbro
9 hours ago





$begingroup$
Are you asking if the gradient is with respect to an action? If yes, then, no, the gradient is not with respect to an action, but with respect to the parameters (of e.g. the neural network representing your policy), $theta_t$. Conceptually, you don't need any "loss" to compute the gradient. You just need a multivariable differentiable function. In this case, the multivariable function is your parametrised policy $pi(A_t mid S_t, theta_t)$.
$endgroup$
– nbro
9 hours ago













$begingroup$
@nbro Conceptually I understand that, but the action is the output of the policy. So what I don’t understand is how conceptually we are determining the gradient of the policy with respect to the action it output? Further if the action is making a choice, that wouldn’t be differentiable. I looked at the implementation that used an action probability instead. But I’m still unsure what is the gradient of the policy with respect to the output of the policy. It doesn’t make sense to me yet.
$endgroup$
– Hanzy
9 hours ago




$begingroup$
@nbro Conceptually I understand that, but the action is the output of the policy. So what I don’t understand is how conceptually we are determining the gradient of the policy with respect to the action it output? Further if the action is making a choice, that wouldn’t be differentiable. I looked at the implementation that used an action probability instead. But I’m still unsure what is the gradient of the policy with respect to the output of the policy. It doesn’t make sense to me yet.
$endgroup$
– Hanzy
9 hours ago












$begingroup$
@nbro I guess I see that it’s technically the gradient of the probability of selecting an action At. But still not sure how to implement that. We want to know how the probability would change with respect to changing the parameters. Maybe I’m starting to get a little insight.
$endgroup$
– Hanzy
8 hours ago




$begingroup$
@nbro I guess I see that it’s technically the gradient of the probability of selecting an action At. But still not sure how to implement that. We want to know how the probability would change with respect to changing the parameters. Maybe I’m starting to get a little insight.
$endgroup$
– Hanzy
8 hours ago




1




1




$begingroup$
I think this is just a notation or terminology issue. To train your neural network representing $pi$, you will need a loss function (that assesses the quality of the output action), yes, but this is not explicit in Barto and Sutton's book (at least in those equations), which just states that the gradient is with respect to the parameters (variables) of the function. Barto and Sutton just present the general idea. Have a look at this reference implementation: github.com/pytorch/examples/blob/master/reinforcement_learning/….
$endgroup$
– nbro
8 hours ago





$begingroup$
I think this is just a notation or terminology issue. To train your neural network representing $pi$, you will need a loss function (that assesses the quality of the output action), yes, but this is not explicit in Barto and Sutton's book (at least in those equations), which just states that the gradient is with respect to the parameters (variables) of the function. Barto and Sutton just present the general idea. Have a look at this reference implementation: github.com/pytorch/examples/blob/master/reinforcement_learning/….
$endgroup$
– nbro
8 hours ago













$begingroup$
@nbro thanks for the link; I actually think I’m almost there. In this implementation we take the (negative) log probability and multiply it by the return. We moved the return inside the gradient since it’s not dependent on $theta$. Then we sum over the list of returns and use the summing function as the function over which we compute the gradient. So the gradient is the SUM of all future rewards wrt to theta (which, in turn, effects the choice of action). Is this correct? (I’m going back to read it again now).
$endgroup$
– Hanzy
4 hours ago




$begingroup$
@nbro thanks for the link; I actually think I’m almost there. In this implementation we take the (negative) log probability and multiply it by the return. We moved the return inside the gradient since it’s not dependent on $theta$. Then we sum over the list of returns and use the summing function as the function over which we compute the gradient. So the gradient is the SUM of all future rewards wrt to theta (which, in turn, effects the choice of action). Is this correct? (I’m going back to read it again now).
$endgroup$
– Hanzy
4 hours ago










1 Answer
1






active

oldest

votes


















2












$begingroup$

The first part of this answer is a little background that might bolster your intuition for what's going on. The second part is the more practical and direct answer to your question.




The gradient is just the generalization of the derivative to multivariable functions. The gradient of a function at a certain point is a vector that points in the direction of the steepest increase of that function.



Usually, we take a derivative/gradient of some loss function $mathcalL$ because we want to minimize that loss. So we update our parameters in the direction opposite the direction of the gradient.



$$theta_t+1 = theta_t - alphanabla_theta_t mathcalL tag1$$



In policy gradient methods, we're not trying to minimize a loss function. Actually, we're trying to maximize some measure $J$ of the performance of our agent. So now we want to update parameters in the same direction as the gradient.



$$theta_t+1 = theta_t + alphanabla_theta_t J tag2$$



In the episodic case, $J$ is the value of the starting state. In the continuing case, $J$ is the average reward. It just so happens that a nice theorem called the Policy Gradient Theorem applies to both cases. This theorem states that



$$beginalign
nabla_theta_tJ(theta_t) &propto sum_s mu(s)sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)\
&=mathbbE_mu left[ sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)right].
endaligntag3
$$



The rest of the derivation is in your question, so let's skip to the end.



$$beginalign
theta_t+1 &= theta_t + alpha G_t fracnabla_theta_tpi(A_tpi(A_t\
&= theta_t + alpha G_t nabla_theta_t ln pi(A_t|S_t,theta_t)
endaligntag4$$



Remember, $(4)$ says exactly the same thing as $(2)$, so REINFORCE just updates parameters in the direction that will most increase $J$. (Because we sample from an expectation in the derivation, the parameter step in REINFORCE is actually an unbiased estimate of the maximizing step.)




Alright, but how do we actually get this gradient? Well, you use the chain rule of derivatives (backpropagation). Practically, though, both Tensorflow and PyTorch can take all the derivatives for you.



Tensorflow, for example, has a minimize() method in its Optimizer class that takes a loss function as an input. Given a function of the parameters of the network, it will do the calculus for you to determine which way to update the parameters in order to minimize that function. But we don't want to minimize. We want to maximize! So just include a negative sign.



In our case, the function we want to minimize is
$$-G_tln pi(A_t|S_t,theta_t).$$



This corresponds to stochastic gradient descent ($G_t$ is not a function of $theta_t$).



You might want to do minibatch gradient descent on each episode of experience in order to get a better (lower variance) estimate of $nabla_theta_t J$. If so, you would instead minimize
$$-sum_t G_tln pi(A_t|S_t,theta_t),$$
where $theta_t$ would be constant for different values of $t$ within the same episode. Technically, minibatch gradient descent updates parameters in the average estimated maximizing direction, but the scaling factor $1/N$ can be absorbed into the learning rate.






share|improve this answer











$endgroup$












  • $begingroup$
    Thanks you answered one of my questions about an implementation summing over the products of the (negative) log probabilities and their associated returns (I hadn’t considered it as mini batch implementation). I really appreciate the detailed answer. I guess I was trying to figure out how to represent it in a framework but I got bogged down in the details and forgot that Sutton / Barto mentioned that $J$ is a representation of the value of the start state. With the two answers provided I’m starting to find my way.
    $endgroup$
    – Hanzy
    2 hours ago











Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "658"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f11929%2fhow-is-the-policy-gradient-calculated-in-reinforce%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2












$begingroup$

The first part of this answer is a little background that might bolster your intuition for what's going on. The second part is the more practical and direct answer to your question.




The gradient is just the generalization of the derivative to multivariable functions. The gradient of a function at a certain point is a vector that points in the direction of the steepest increase of that function.



Usually, we take a derivative/gradient of some loss function $mathcalL$ because we want to minimize that loss. So we update our parameters in the direction opposite the direction of the gradient.



$$theta_t+1 = theta_t - alphanabla_theta_t mathcalL tag1$$



In policy gradient methods, we're not trying to minimize a loss function. Actually, we're trying to maximize some measure $J$ of the performance of our agent. So now we want to update parameters in the same direction as the gradient.



$$theta_t+1 = theta_t + alphanabla_theta_t J tag2$$



In the episodic case, $J$ is the value of the starting state. In the continuing case, $J$ is the average reward. It just so happens that a nice theorem called the Policy Gradient Theorem applies to both cases. This theorem states that



$$beginalign
nabla_theta_tJ(theta_t) &propto sum_s mu(s)sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)\
&=mathbbE_mu left[ sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)right].
endaligntag3
$$



The rest of the derivation is in your question, so let's skip to the end.



$$beginalign
theta_t+1 &= theta_t + alpha G_t fracnabla_theta_tpi(A_tpi(A_t\
&= theta_t + alpha G_t nabla_theta_t ln pi(A_t|S_t,theta_t)
endaligntag4$$



Remember, $(4)$ says exactly the same thing as $(2)$, so REINFORCE just updates parameters in the direction that will most increase $J$. (Because we sample from an expectation in the derivation, the parameter step in REINFORCE is actually an unbiased estimate of the maximizing step.)




Alright, but how do we actually get this gradient? Well, you use the chain rule of derivatives (backpropagation). Practically, though, both Tensorflow and PyTorch can take all the derivatives for you.



Tensorflow, for example, has a minimize() method in its Optimizer class that takes a loss function as an input. Given a function of the parameters of the network, it will do the calculus for you to determine which way to update the parameters in order to minimize that function. But we don't want to minimize. We want to maximize! So just include a negative sign.



In our case, the function we want to minimize is
$$-G_tln pi(A_t|S_t,theta_t).$$



This corresponds to stochastic gradient descent ($G_t$ is not a function of $theta_t$).



You might want to do minibatch gradient descent on each episode of experience in order to get a better (lower variance) estimate of $nabla_theta_t J$. If so, you would instead minimize
$$-sum_t G_tln pi(A_t|S_t,theta_t),$$
where $theta_t$ would be constant for different values of $t$ within the same episode. Technically, minibatch gradient descent updates parameters in the average estimated maximizing direction, but the scaling factor $1/N$ can be absorbed into the learning rate.






share|improve this answer











$endgroup$












  • $begingroup$
    Thanks you answered one of my questions about an implementation summing over the products of the (negative) log probabilities and their associated returns (I hadn’t considered it as mini batch implementation). I really appreciate the detailed answer. I guess I was trying to figure out how to represent it in a framework but I got bogged down in the details and forgot that Sutton / Barto mentioned that $J$ is a representation of the value of the start state. With the two answers provided I’m starting to find my way.
    $endgroup$
    – Hanzy
    2 hours ago















2












$begingroup$

The first part of this answer is a little background that might bolster your intuition for what's going on. The second part is the more practical and direct answer to your question.




The gradient is just the generalization of the derivative to multivariable functions. The gradient of a function at a certain point is a vector that points in the direction of the steepest increase of that function.



Usually, we take a derivative/gradient of some loss function $mathcalL$ because we want to minimize that loss. So we update our parameters in the direction opposite the direction of the gradient.



$$theta_t+1 = theta_t - alphanabla_theta_t mathcalL tag1$$



In policy gradient methods, we're not trying to minimize a loss function. Actually, we're trying to maximize some measure $J$ of the performance of our agent. So now we want to update parameters in the same direction as the gradient.



$$theta_t+1 = theta_t + alphanabla_theta_t J tag2$$



In the episodic case, $J$ is the value of the starting state. In the continuing case, $J$ is the average reward. It just so happens that a nice theorem called the Policy Gradient Theorem applies to both cases. This theorem states that



$$beginalign
nabla_theta_tJ(theta_t) &propto sum_s mu(s)sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)\
&=mathbbE_mu left[ sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)right].
endaligntag3
$$



The rest of the derivation is in your question, so let's skip to the end.



$$beginalign
theta_t+1 &= theta_t + alpha G_t fracnabla_theta_tpi(A_tpi(A_t\
&= theta_t + alpha G_t nabla_theta_t ln pi(A_t|S_t,theta_t)
endaligntag4$$



Remember, $(4)$ says exactly the same thing as $(2)$, so REINFORCE just updates parameters in the direction that will most increase $J$. (Because we sample from an expectation in the derivation, the parameter step in REINFORCE is actually an unbiased estimate of the maximizing step.)




Alright, but how do we actually get this gradient? Well, you use the chain rule of derivatives (backpropagation). Practically, though, both Tensorflow and PyTorch can take all the derivatives for you.



Tensorflow, for example, has a minimize() method in its Optimizer class that takes a loss function as an input. Given a function of the parameters of the network, it will do the calculus for you to determine which way to update the parameters in order to minimize that function. But we don't want to minimize. We want to maximize! So just include a negative sign.



In our case, the function we want to minimize is
$$-G_tln pi(A_t|S_t,theta_t).$$



This corresponds to stochastic gradient descent ($G_t$ is not a function of $theta_t$).



You might want to do minibatch gradient descent on each episode of experience in order to get a better (lower variance) estimate of $nabla_theta_t J$. If so, you would instead minimize
$$-sum_t G_tln pi(A_t|S_t,theta_t),$$
where $theta_t$ would be constant for different values of $t$ within the same episode. Technically, minibatch gradient descent updates parameters in the average estimated maximizing direction, but the scaling factor $1/N$ can be absorbed into the learning rate.






share|improve this answer











$endgroup$












  • $begingroup$
    Thanks you answered one of my questions about an implementation summing over the products of the (negative) log probabilities and their associated returns (I hadn’t considered it as mini batch implementation). I really appreciate the detailed answer. I guess I was trying to figure out how to represent it in a framework but I got bogged down in the details and forgot that Sutton / Barto mentioned that $J$ is a representation of the value of the start state. With the two answers provided I’m starting to find my way.
    $endgroup$
    – Hanzy
    2 hours ago













2












2








2





$begingroup$

The first part of this answer is a little background that might bolster your intuition for what's going on. The second part is the more practical and direct answer to your question.




The gradient is just the generalization of the derivative to multivariable functions. The gradient of a function at a certain point is a vector that points in the direction of the steepest increase of that function.



Usually, we take a derivative/gradient of some loss function $mathcalL$ because we want to minimize that loss. So we update our parameters in the direction opposite the direction of the gradient.



$$theta_t+1 = theta_t - alphanabla_theta_t mathcalL tag1$$



In policy gradient methods, we're not trying to minimize a loss function. Actually, we're trying to maximize some measure $J$ of the performance of our agent. So now we want to update parameters in the same direction as the gradient.



$$theta_t+1 = theta_t + alphanabla_theta_t J tag2$$



In the episodic case, $J$ is the value of the starting state. In the continuing case, $J$ is the average reward. It just so happens that a nice theorem called the Policy Gradient Theorem applies to both cases. This theorem states that



$$beginalign
nabla_theta_tJ(theta_t) &propto sum_s mu(s)sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)\
&=mathbbE_mu left[ sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)right].
endaligntag3
$$



The rest of the derivation is in your question, so let's skip to the end.



$$beginalign
theta_t+1 &= theta_t + alpha G_t fracnabla_theta_tpi(A_tpi(A_t\
&= theta_t + alpha G_t nabla_theta_t ln pi(A_t|S_t,theta_t)
endaligntag4$$



Remember, $(4)$ says exactly the same thing as $(2)$, so REINFORCE just updates parameters in the direction that will most increase $J$. (Because we sample from an expectation in the derivation, the parameter step in REINFORCE is actually an unbiased estimate of the maximizing step.)




Alright, but how do we actually get this gradient? Well, you use the chain rule of derivatives (backpropagation). Practically, though, both Tensorflow and PyTorch can take all the derivatives for you.



Tensorflow, for example, has a minimize() method in its Optimizer class that takes a loss function as an input. Given a function of the parameters of the network, it will do the calculus for you to determine which way to update the parameters in order to minimize that function. But we don't want to minimize. We want to maximize! So just include a negative sign.



In our case, the function we want to minimize is
$$-G_tln pi(A_t|S_t,theta_t).$$



This corresponds to stochastic gradient descent ($G_t$ is not a function of $theta_t$).



You might want to do minibatch gradient descent on each episode of experience in order to get a better (lower variance) estimate of $nabla_theta_t J$. If so, you would instead minimize
$$-sum_t G_tln pi(A_t|S_t,theta_t),$$
where $theta_t$ would be constant for different values of $t$ within the same episode. Technically, minibatch gradient descent updates parameters in the average estimated maximizing direction, but the scaling factor $1/N$ can be absorbed into the learning rate.






share|improve this answer











$endgroup$



The first part of this answer is a little background that might bolster your intuition for what's going on. The second part is the more practical and direct answer to your question.




The gradient is just the generalization of the derivative to multivariable functions. The gradient of a function at a certain point is a vector that points in the direction of the steepest increase of that function.



Usually, we take a derivative/gradient of some loss function $mathcalL$ because we want to minimize that loss. So we update our parameters in the direction opposite the direction of the gradient.



$$theta_t+1 = theta_t - alphanabla_theta_t mathcalL tag1$$



In policy gradient methods, we're not trying to minimize a loss function. Actually, we're trying to maximize some measure $J$ of the performance of our agent. So now we want to update parameters in the same direction as the gradient.



$$theta_t+1 = theta_t + alphanabla_theta_t J tag2$$



In the episodic case, $J$ is the value of the starting state. In the continuing case, $J$ is the average reward. It just so happens that a nice theorem called the Policy Gradient Theorem applies to both cases. This theorem states that



$$beginalign
nabla_theta_tJ(theta_t) &propto sum_s mu(s)sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)\
&=mathbbE_mu left[ sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)right].
endaligntag3
$$



The rest of the derivation is in your question, so let's skip to the end.



$$beginalign
theta_t+1 &= theta_t + alpha G_t fracnabla_theta_tpi(A_tpi(A_t\
&= theta_t + alpha G_t nabla_theta_t ln pi(A_t|S_t,theta_t)
endaligntag4$$



Remember, $(4)$ says exactly the same thing as $(2)$, so REINFORCE just updates parameters in the direction that will most increase $J$. (Because we sample from an expectation in the derivation, the parameter step in REINFORCE is actually an unbiased estimate of the maximizing step.)




Alright, but how do we actually get this gradient? Well, you use the chain rule of derivatives (backpropagation). Practically, though, both Tensorflow and PyTorch can take all the derivatives for you.



Tensorflow, for example, has a minimize() method in its Optimizer class that takes a loss function as an input. Given a function of the parameters of the network, it will do the calculus for you to determine which way to update the parameters in order to minimize that function. But we don't want to minimize. We want to maximize! So just include a negative sign.



In our case, the function we want to minimize is
$$-G_tln pi(A_t|S_t,theta_t).$$



This corresponds to stochastic gradient descent ($G_t$ is not a function of $theta_t$).



You might want to do minibatch gradient descent on each episode of experience in order to get a better (lower variance) estimate of $nabla_theta_t J$. If so, you would instead minimize
$$-sum_t G_tln pi(A_t|S_t,theta_t),$$
where $theta_t$ would be constant for different values of $t$ within the same episode. Technically, minibatch gradient descent updates parameters in the average estimated maximizing direction, but the scaling factor $1/N$ can be absorbed into the learning rate.







share|improve this answer














share|improve this answer



share|improve this answer








edited 17 mins ago

























answered 2 hours ago









Philip RaeisghasemPhilip Raeisghasem

1,169121




1,169121











  • $begingroup$
    Thanks you answered one of my questions about an implementation summing over the products of the (negative) log probabilities and their associated returns (I hadn’t considered it as mini batch implementation). I really appreciate the detailed answer. I guess I was trying to figure out how to represent it in a framework but I got bogged down in the details and forgot that Sutton / Barto mentioned that $J$ is a representation of the value of the start state. With the two answers provided I’m starting to find my way.
    $endgroup$
    – Hanzy
    2 hours ago
















  • $begingroup$
    Thanks you answered one of my questions about an implementation summing over the products of the (negative) log probabilities and their associated returns (I hadn’t considered it as mini batch implementation). I really appreciate the detailed answer. I guess I was trying to figure out how to represent it in a framework but I got bogged down in the details and forgot that Sutton / Barto mentioned that $J$ is a representation of the value of the start state. With the two answers provided I’m starting to find my way.
    $endgroup$
    – Hanzy
    2 hours ago















$begingroup$
Thanks you answered one of my questions about an implementation summing over the products of the (negative) log probabilities and their associated returns (I hadn’t considered it as mini batch implementation). I really appreciate the detailed answer. I guess I was trying to figure out how to represent it in a framework but I got bogged down in the details and forgot that Sutton / Barto mentioned that $J$ is a representation of the value of the start state. With the two answers provided I’m starting to find my way.
$endgroup$
– Hanzy
2 hours ago




$begingroup$
Thanks you answered one of my questions about an implementation summing over the products of the (negative) log probabilities and their associated returns (I hadn’t considered it as mini batch implementation). I really appreciate the detailed answer. I guess I was trying to figure out how to represent it in a framework but I got bogged down in the details and forgot that Sutton / Barto mentioned that $J$ is a representation of the value of the start state. With the two answers provided I’m starting to find my way.
$endgroup$
– Hanzy
2 hours ago

















draft saved

draft discarded
















































Thanks for contributing an answer to Artificial Intelligence Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f11929%2fhow-is-the-policy-gradient-calculated-in-reinforce%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Möglingen Índice Localización Historia Demografía Referencias Enlaces externos Menú de navegación48°53′18″N 9°07′45″E / 48.888333333333, 9.129166666666748°53′18″N 9°07′45″E / 48.888333333333, 9.1291666666667Sitio web oficial Mapa de Möglingen«Gemeinden in Deutschland nach Fläche, Bevölkerung und Postleitzahl am 30.09.2016»Möglingen

Virtualbox - Configuration error: Querying “UUID” failed (VERR_CFGM_VALUE_NOT_FOUND)“VERR_SUPLIB_WORLD_WRITABLE” error when trying to installing OS in virtualboxVirtual Box Kernel errorFailed to open a seesion for the virtual machineFailed to open a session for the virtual machineUbuntu 14.04 LTS Virtualbox errorcan't use VM VirtualBoxusing virtualboxI can't run Linux-64 Bit on VirtualBoxUnable to insert the virtual optical disk (VBoxguestaddition) in virtual machine for ubuntu server in win 10VirtuaBox in Ubuntu 18.04 Issues with Win10.ISO Installation

Antonio De Lisio Carrera Referencias Menú de navegación«Caracas: evolución relacional multipleja»«Cuando los gobiernos subestiman a las localidades: L a Iniciativa para la Integración de la Infraestructura Regional Suramericana (IIRSA) en la frontera Colombo-Venezolana»«Maestría en Planificación Integral del Ambiente»«La Metrópoli Caraqueña: Expansión Simplificadora o Articulación Diversificante»«La Metrópoli Caraqueña: Expansión Simplificadora o Articulación Diversificante»«Conózcanos»«Caracas: evolución relacional multipleja»«La Metrópoli Caraqueña: Expansión Simplificadora o Articulación Diversificante»