Post History

66%

+2 −0

Q&A Getting backward of partial differentiation's chain rule

posted 4y ago by Derek Elkins‭ · edited 3y ago by Derek Elkins‭

Answer

#2: Post edited by $user avatar$ Derek Elkins‭ · 2021-09-13T13:09:59Z (over 3 years ago)
Talk about the actual total derivative

Copy Link

Raw

Markdown

Traditional mathematical notation for calculus (both integral and differential) is rather incoherent. I don't think there exists a write-up providing systematic rules that would allow you to correctly and unambiguously parse this kind of notation, i.e. the kind of notation used in a typical undergrad multivariable calculus textbook. By "systematic", I mean you could write a program to do it (and, for simplicity, I'll say the input comes in the form of a subset of MathJaX, I'm not asking for optical character recognition). By "correctly and unambiguously", I mean that program produces one result and the result is the one intended by the author (modulo typos). I'm also not just talking about *inconsistent* notation between authors/books, though that also doesn't help. I'm imagining a scenario where you randomly select a multivariable calculus textbook and make a system that only needs to (correctly) handle notation in the style of that book. Certainly something like this for a "traditional" notation isn't common knowledge. If you try to build a formal system for this notation, you quickly find that it is, at best, non-obvious and unusual.
The fact that traditional notation is unclear doesn't mean that there don't exist *other* notations that *are* clear. As an extreme example, the notation used in the book [Structure and Interpretation of Classical Mechanics](https://groups.csail.mit.edu/mac/users/gjs/6946/sicm-html/book.html) is undoubtedly unambiguous as it's literally executable code. Of course, it's also quite far from traditional notation. I recommend the [preface](https://groups.csail.mit.edu/mac/users/gjs/6946/sicm-html/book-Z-H-4.html#%_chap_Temp_2) and the [footnote](https://groups.csail.mit.edu/mac/users/gjs/6946/sicm-html/book-Z-H-4.html#call_footnote_Temp_4) quoting Spivak's *Calculus on Manifolds* for more specific and authoritative critiques of traditional notation. (The footnote talks about literally exactly this example of the chain rule.)
The starting point for most approaches to clearer notation is the fact that semantically differentiation acts on *functions*. Before continuing here, another common conflation is a function, $f$, with an open expression $f(x)$. To say that differentiation acts on functions means (syntactically) that it should act on $f$ *not* $f(x)$. This will be illustrated. The simplest example is differentiating a real-valued real function. We might write the derivative of such a function, $f$, as $Df$. The result of this is also a function. So, for example, if $f(x) = x^2$ then $(Df)(x) = 2x$. The differential operator $D$ takes highest precedence so $(Df)(x) = Df(x)$. This should be easy to remember because it doesn't make sense to apply $D$ to a real number $f(x)$. Note how since $Df$ is a function, we need to apply it to an argument. That argument could be anything standing for a real number, in particular, $Df(y) = 2y$ and $Df(3) = 6$. To really make this notation usable, it helps to have a notation for anonymous functions, e.g. $D(x \mapsto x^2) = x \mapsto 2x$.
However, this one-dimensional case is too simple and, as we'll see, a bit misleading. One approach for handling multiparameter differentiation, i.e. partial derivatives, is to have the differential operator indicate which parameter it's operating on, e.g. if $f$ took two arguments, then you might write $\partial_1 f$ to indicate (partial) differentiation with respect to the first argument and $\partial_2 f$ with respect to the second. However, I think it's cleaner and more effective to talk about directional derivatives. In fact, I think directional derivatives make a nice powerful, yet approachable basis for differential calculus. There are other rather elegant foundations too, such as taking the vector derivative as primitive, but that takes a bit more setup and often you end up working with directional derivatives anyway.
So, a (potentially vector-valued) function $f$ taking $n$ real arguments can instead be viewed as a function taking a single argument that is an $n$-dimensional vector. That is, we can think of $f(x, y, z)$ as being $f(x\mathbf e_x + y\mathbf e_y + z\mathbf e_z)$ where $\mathbf e_x$, $\mathbf e_y$, $\mathbf e_z$ are orthonormal basis vectors of $\mathbb R^3$, in this case. Given a function $f : \mathbb R^n \to \mathbb R^m$ and an $n$-dimensional vector $\mathbf v$, we can define the directional derivative of $f$ in the direction $\mathbf v$ as $$\partial_{\mathbf v}f(\mathbf x) = \lim_{\epsilon \to 0}\frac{f(\mathbf x + \epsilon\mathbf v) - f(\mathbf x)}{\epsilon}$$ $\partial_i f$ can now be identified with $\partial_{\mathbf e_i} f$ where we (arbitrarily) label the basis vectors $\mathbf e_1, \dots, \mathbf e_n$. The $D$ operator from above is also recovered as the $n=1$ case of $\partial_{\mathbf e_1}$.
For your specific case we have the functions $\mathbf x : \mathbb R \to \mathbb R^2$ and $z : \mathbb R^2 \to \mathbb R$ where $\mathbf x(t) = x(t)\mathbf e_1 + y(t)\mathbf e_2$. $z \circ \mathbf x : \mathbb R \to \mathbb R$ so we can apply $D$ to it, i.e. $D(z \circ \mathbf x)$ makes sense. The chain rule is then saying $$D(z \circ \mathbf x)(t) = \partial_1 z(\mathbf x(t))(\mathbf e_1 \cdot (D\mathbf x)(t)) + \partial_2 z(\mathbf x(t))(\mathbf e_2 \cdot (D\mathbf x)(t))$$ This notation makes your error harder to make and easier to understand. Namely, $\partial z/\partial t$ doesn't make sense since it means differentiate $z$, a function defined on $\mathbb R^2$, in the direction of a vector in $\mathbb R^1$. This doesn't make sense vectors in $\mathbb R^1$ aren't vectors in $\mathbb R^2$.^[This distinction is even clearer when we consider the more general case of functions between arbitrary manifolds. In that case, we need to consider vectors in the tangent spaces for those manifolds and those could be totally different.] (They can certainly be embedded into $\mathbb R^2$ but in many distinct ways.)
One thing you might have noticed is I didn't define a "total derivative". The notion of a "total" derivative is to disambiguate between whether we mean differentiation of $z$ or of $z \circ \mathbf x$. The problem is caused by the common conflation of $f$ with $f(x)$ I mentioned before. Now it becomes ambiguous whether $z$ means $z(x, y)$, or $z(x(t), y(t))$ (or $z(x(t), y)$ or $z(x, y(t))$ for that matter).
All this said, I'm not saying you should never use traditional notation or that you should only use this notation. Instead, when traditional notation seems confusing, falling back to a function-oriented approach and the directional derivative can help clear things up. Also, it's important to understand that a huge amount of relevant information is left implicit in traditional notation.

Traditional mathematical notation for calculus (both integral and differential) is rather incoherent. I don't think there exists a write-up providing systematic rules that would allow you to correctly and unambiguously parse this kind of notation, i.e. the kind of notation used in a typical undergrad multivariable calculus textbook. By "systematic", I mean you could write a program to do it (and, for simplicity, I'll say the input comes in the form of a subset of MathJaX, I'm not asking for optical character recognition). By "correctly and unambiguously", I mean that program produces one result and the result is the one intended by the author (modulo typos). I'm also not just talking about *inconsistent* notation between authors/books, though that also doesn't help. I'm imagining a scenario where you randomly select a multivariable calculus textbook and make a system that only needs to (correctly) handle notation in the style of that book. Certainly something like this for a "traditional" notation isn't common knowledge. If you try to build a formal system for this notation, you quickly find that it is, at best, non-obvious and unusual.
The fact that traditional notation is unclear doesn't mean that there don't exist *other* notations that *are* clear. As an extreme example, the notation used in the book [Structure and Interpretation of Classical Mechanics](https://groups.csail.mit.edu/mac/users/gjs/6946/sicm-html/book.html) is undoubtedly unambiguous as it's literally executable code. Of course, it's also quite far from traditional notation. I recommend the [preface](https://groups.csail.mit.edu/mac/users/gjs/6946/sicm-html/book-Z-H-4.html#%_chap_Temp_2) and the [footnote](https://groups.csail.mit.edu/mac/users/gjs/6946/sicm-html/book-Z-H-4.html#call_footnote_Temp_4) quoting Spivak's *Calculus on Manifolds* for more specific and authoritative critiques of traditional notation. (The footnote talks about literally exactly this example of the chain rule.)
The starting point for most approaches to clearer notation is the fact that semantically differentiation acts on *functions*. Before continuing here, another common conflation is a function, $f$, with an open expression $f(x)$. To say that differentiation acts on functions means (syntactically) that it should act on $f$ *not* $f(x)$. This will be illustrated. The simplest example is differentiating a real-valued real function. We might write the derivative of such a function, $f$, as $Df$. The result of this is also a function. So, for example, if $f(x) = x^2$ then $(Df)(x) = 2x$. The differential operator $D$ takes highest precedence so $(Df)(x) = Df(x)$. This should be easy to remember because it doesn't make sense to apply $D$ to a real number $f(x)$. Note how since $Df$ is a function, we need to apply it to an argument. That argument could be anything standing for a real number, in particular, $Df(y) = 2y$ and $Df(3) = 6$. To really make this notation usable, it helps to have a notation for anonymous functions, e.g. $D(x \mapsto x^2) = x \mapsto 2x$.
However, this one-dimensional case is too simple and, as we'll see, a bit misleading. One approach for handling multiparameter differentiation, i.e. partial derivatives, is to have the differential operator indicate which parameter it's operating on, e.g. if $f$ took two arguments, then you might write $\partial_1 f$ to indicate (partial) differentiation with respect to the first argument and $\partial_2 f$ with respect to the second. However, I think it's cleaner and more effective to talk about directional derivatives. In fact, I think directional derivatives make a nice powerful, yet approachable basis for differential calculus. There are other rather elegant foundations too, such as taking the vector derivative as primitive, but that takes a bit more setup and often you end up working with directional derivatives anyway.
So, a (potentially vector-valued) function $f$ taking $n$ real arguments can instead be viewed as a function taking a single argument that is an $n$-dimensional vector. That is, we can think of $f(x, y, z)$ as being $f(x\mathbf e_x + y\mathbf e_y + z\mathbf e_z)$ where $\mathbf e_x$, $\mathbf e_y$, $\mathbf e_z$ are orthonormal basis vectors of $\mathbb R^3$, in this case. Given a function $f : \mathbb R^n \to \mathbb R^m$ and an $n$-dimensional vector $\mathbf v$, we can define the directional derivative of $f$ in the direction $\mathbf v$ as $$\partial_{\mathbf v}f(\mathbf x) = \lim_{\epsilon \to 0}\frac{f(\mathbf x + \epsilon\mathbf v) - f(\mathbf x)}{\epsilon}$$ $\partial_i f$ can now be identified with $\partial_{\mathbf e_i} f$ where we (arbitrarily) label the basis vectors $\mathbf e_1, \dots, \mathbf e_n$. The $D$ operator from above is also recovered as the $n=1$ case of $\partial_{\mathbf e_1}$.
For your specific case we have the functions $\mathbf x : \mathbb R \to \mathbb R^2$ and $z : \mathbb R^2 \to \mathbb R$ where $\mathbf x(t) = x(t)\mathbf e_1 + y(t)\mathbf e_2$. $z \circ \mathbf x : \mathbb R \to \mathbb R$ so we can apply $D$ to it, i.e. $D(z \circ \mathbf x)$ makes sense. The chain rule is then saying $$D(z \circ \mathbf x)(t) = \partial_1 z(\mathbf x(t))(\mathbf e_1 \cdot (D\mathbf x)(t)) + \partial_2 z(\mathbf x(t))(\mathbf e_2 \cdot (D\mathbf x)(t))$$ This notation makes your error harder to make and easier to understand. Namely, $\partial z/\partial t$ doesn't make sense since it means differentiate $z$, a function defined on $\mathbb R^2$, in the direction of a vector in $\mathbb R^1$. This doesn't make sense vectors in $\mathbb R^1$ aren't vectors in $\mathbb R^2$.^[This distinction is even clearer when we consider the more general case of functions between arbitrary manifolds. In that case, we need to consider vectors in the tangent spaces for those manifolds and those could be totally different.] (They can certainly be embedded into $\mathbb R^2$ but in many distinct ways.)
One thing you might have noticed is I didn't define a "total derivative"^[We can define what total derivative actually means, and it only further illustrates the ambiguity and hidden complexity of traditional notation. Here's a simple example. Let $f(x, y) = x^2 + y^2$. The idea is that $x$ and $y$ will represent components of trajectory. The total derivative of $f$ is then a function $g(x, y, u, v) = 2xu + 2yv$. The idea here is that $$\frac{df(x(t), y(t))}{dt} = g(x(t), y(t), \frac{dx}{dt}(t), \frac{dy}{dt}(t))$$ To really describe what's happening in general leads to the notion of a [jet bundle](https://en.wikipedia.org/wiki/Jet_bundle#Vector_fields) which is usually not discussed, even in simplified form, unless you go deep into certain fields. For the purposes of the discussion here, if we actually want the total derivative of $x^2 + y^2$ which would traditionally be written something like $2x\frac{dx}{dt} + 2y\frac{dy}{dt}$ and not just the resulting function of $t$, then $\frac{dx}{dt}$ and $\frac{dy}{dt}$ are really effectively new parameters. Technically, they are constrained to be values of some derivative of some trajectory that goes through $(x, y)$, but, for the Euclidean plane, that's no constraint at all. With the notation in this answer, the total derivative of $f$ is $(\mathbf x, \mathbf v) \mapsto \partial_{\mathbf v}f(\mathbf x)$. That is, it's the function which takes the point at which to evaluate the directional derivative of $f$ *and also the direction* (upon which it depends linearly). This notation makes it clear that there are extra parameters involved and what they mean.]. Here "total" derivative is to disambiguate between whether we mean differentiation of $z$ or of $z \circ \mathbf x$. (This is more ambiguous when $z$ is itself explicitly a function of $t$ so that partial differentiation of $z$ by $t$ also makes sense.) The problem is caused by the common conflation of $f$ with $f(x)$ I mentioned before. Now it becomes ambiguous whether $z$ means $z(x, y)$, or $z(x(t), y(t))$ (or $z(x(t), y)$ or $z(x, y(t))$ for that matter).
All this said, I'm not saying you should never use traditional notation or that you should only use this notation. Instead, when traditional notation seems confusing, falling back to a function-oriented approach and the directional derivative can help clear things up. Also, it's important to understand that a huge amount of relevant information is left implicit in traditional notation.

#1: Initial revision by $user avatar$ Derek Elkins‭ · 2021-09-01T05:02:55Z (almost 4 years ago)

Copy Link

Raw

Markdown

Traditional mathematical notation for calculus (both integral and differential) is rather incoherent. I don't think there exists a write-up providing systematic rules that would allow you to correctly and unambiguously parse this kind of notation, i.e. the kind of notation used in a typical undergrad multivariable calculus textbook. By "systematic", I mean you could write a program to do it (and, for simplicity, I'll say the input comes in the form of a subset of MathJaX, I'm not asking for optical character recognition). By "correctly and unambiguously", I mean that program produces one result and the result is the one intended by the author (modulo typos). I'm also not just talking about *inconsistent* notation between authors/books, though that also doesn't help. I'm imagining a scenario where you randomly select a multivariable calculus textbook and make a system that only needs to (correctly) handle notation in the style of that book. Certainly something like this for a "traditional" notation isn't common knowledge. If you try to build a formal system for this notation, you quickly find that it is, at best, non-obvious and unusual.

The fact that traditional notation is unclear doesn't mean that there don't exist *other* notations that *are* clear. As an extreme example, the notation used in the book [Structure and Interpretation of Classical Mechanics](https://groups.csail.mit.edu/mac/users/gjs/6946/sicm-html/book.html) is undoubtedly unambiguous as it's literally executable code. Of course, it's also quite far from traditional notation. I recommend the [preface](https://groups.csail.mit.edu/mac/users/gjs/6946/sicm-html/book-Z-H-4.html#%_chap_Temp_2) and the [footnote](https://groups.csail.mit.edu/mac/users/gjs/6946/sicm-html/book-Z-H-4.html#call_footnote_Temp_4) quoting Spivak's *Calculus on Manifolds* for more specific and authoritative critiques of traditional notation. (The footnote talks about literally exactly this example of the chain rule.)

The starting point for most approaches to clearer notation is the fact that semantically differentiation acts on *functions*. Before continuing here, another common conflation is a function, $f$, with an open expression $f(x)$. To say that differentiation acts on functions means (syntactically) that it should act on $f$ *not* $f(x)$. This will be illustrated. The simplest example is differentiating a real-valued real function. We might write the derivative of such a function, $f$, as $Df$. The result of this is also a function. So, for example, if $f(x) = x^2$ then $(Df)(x) = 2x$. The differential operator $D$ takes highest precedence so $(Df)(x) = Df(x)$. This should be easy to remember because it doesn't make sense to apply $D$ to a real number $f(x)$. Note how since $Df$ is a function, we need to apply it to an argument. That argument could be anything standing for a real number, in particular, $Df(y) = 2y$ and $Df(3) = 6$. To really make this notation usable, it helps to have a notation for anonymous functions, e.g. $D(x \mapsto x^2) = x \mapsto 2x$.

However, this one-dimensional case is too simple and, as we'll see, a bit misleading. One approach for handling multiparameter differentiation, i.e. partial derivatives, is to have the differential operator indicate which parameter it's operating on, e.g. if $f$ took two arguments, then you might write $\partial_1 f$ to indicate (partial) differentiation with respect to the first argument and $\partial_2 f$ with respect to the second. However, I think it's cleaner and more effective to talk about directional derivatives. In fact, I think directional derivatives make a nice powerful, yet approachable basis for differential calculus. There are other rather elegant foundations too, such as taking the vector derivative as primitive, but that takes a bit more setup and often you end up working with directional derivatives anyway.

So, a (potentially vector-valued) function $f$ taking $n$ real arguments can instead be viewed as a function taking a single argument that is an $n$-dimensional vector. That is, we can think of $f(x, y, z)$ as being $f(x\mathbf e_x + y\mathbf e_y + z\mathbf e_z)$ where $\mathbf e_x$, $\mathbf e_y$, $\mathbf e_z$ are orthonormal basis vectors of $\mathbb R^3$, in this case. Given a function $f : \mathbb R^n \to \mathbb R^m$ and an $n$-dimensional vector $\mathbf v$, we can define the directional derivative of $f$ in the direction $\mathbf v$ as $$\partial_{\mathbf v}f(\mathbf x) = \lim_{\epsilon \to 0}\frac{f(\mathbf x + \epsilon\mathbf v) - f(\mathbf x)}{\epsilon}$$ $\partial_i f$ can now be identified with $\partial_{\mathbf e_i} f$ where we (arbitrarily) label the basis vectors $\mathbf e_1, \dots, \mathbf e_n$. The $D$ operator from above is also recovered as the $n=1$ case of $\partial_{\mathbf e_1}$.

For your specific case we have the functions $\mathbf x : \mathbb R \to \mathbb R^2$ and $z : \mathbb R^2 \to \mathbb R$ where $\mathbf x(t) = x(t)\mathbf e_1 + y(t)\mathbf e_2$. $z \circ \mathbf x : \mathbb R \to \mathbb R$ so we can apply $D$ to it, i.e. $D(z \circ \mathbf x)$ makes sense. The chain rule is then saying $$D(z \circ \mathbf x)(t) = \partial_1 z(\mathbf x(t))(\mathbf e_1 \cdot (D\mathbf x)(t)) + \partial_2 z(\mathbf x(t))(\mathbf e_2 \cdot (D\mathbf x)(t))$$ This notation makes your error harder to make and easier to understand. Namely, $\partial z/\partial t$ doesn't make sense since it means differentiate $z$, a function defined on $\mathbb R^2$, in the direction of a vector in $\mathbb R^1$. This doesn't make sense vectors in $\mathbb R^1$ aren't vectors in $\mathbb R^2$.^[This distinction is even clearer when we consider the more general case of functions between arbitrary manifolds. In that case, we need to consider vectors in the tangent spaces for those manifolds and those could be totally different.] (They can certainly be embedded into $\mathbb R^2$ but in many distinct ways.)

One thing you might have noticed is I didn't define a "total derivative". The notion of a "total" derivative is to disambiguate between whether we mean differentiation of $z$ or of $z \circ \mathbf x$. The problem is caused by the common conflation of $f$ with $f(x)$ I mentioned before. Now it becomes ambiguous whether $z$ means $z(x, y)$, or $z(x(t), y(t))$ (or $z(x(t), y)$ or $z(x, y(t))$ for that matter). 

All this said, I'm not saying you should never use traditional notation or that you should only use this notation. Instead, when traditional notation seems confusing, falling back to a function-oriented approach and the directional derivative can help clear things up. Also, it's important to understand that a huge amount of relevant information is left implicit in traditional notation.

Communities

Post History