📝笔记：简明矩阵求导术之分子布局与分母布局

Posted on 2022-03-01 Edited on 2022-08-30 In 知识库 Waline: Views:

矩阵或者向量求导时经常会被分子/分母布局搞得头大，如什么时候转置，什么时候不转置。本文将简明介绍常用的矩阵/向量求导技巧。

简单例子

\(\mathbf{a}^{\top} \mathbf{x}\) 对向量\(\mathbf{x}\)求导，举个例子：令\(\mathbf{a} = \left[\begin{array}{ll} 1 \\ 2 \end{array}\right]\)， \(\mathbf{x} = \left[\begin{array}{ll} x_1 \\ x_2 \end{array}\right]\)，则 \(\mathbf{a}^{\top} \mathbf{x} =x_1+ 2x_2\)

于是

\[ \begin{aligned} \frac{\partial \mathbf{a}^{\top}\mathbf{x}}{\partial \mathbf{x}} = & {\left[\begin{array}{ll} \frac{\partial (x_1+ 2x_2)}{\partial x_{1}} \\ \frac{\partial (x_1+ 2x_2)}{\partial x_{2}} \end{array}\right] } \\ =& {\left[\begin{array}{ll} 1 \\ 2 \end{array}\right] } \\ =& \mathbf{a} (分母布局) \end{aligned} \]

注意，上述结果以分母布局进行排布，具体见后一节的一般形式。

\(\mathbf{A} \mathbf{x}\) 对向量\(\mathbf{x}\)求导，可以举一个具体的例子对求导过程进行推导。

令\(\mathbf{A} = \left[\begin{array}{ll} 1 & 2 \\ 3 & 4 \end{array}\right]\)， \(\mathbf{x} = \left[\begin{array}{ll} x_1 \\ x_2 \end{array}\right]\) 则：

\[ \begin{aligned} \mathbf{A x} =& {\left[\begin{array}{ll} 1 & 2 \\ 3 & 4 \end{array}\right]\left[\begin{array}{l} x_{1} \\ x_{2} \end{array}\right] } \\ =& {\left[\begin{array}{l} x_{1}+2 x_{2} \\ 3 x_{1}+4 x_{2} \end{array}\right] } \\ =&\left[\begin{array}{c} f_{1} \\ f_{2} \end{array}\right] \end{aligned} \]

所以，

\[ \begin{aligned} \frac{\partial \mathbf{A}\mathbf{x}}{\partial \mathbf{x}} = & {\left[\begin{array}{ll} \frac{\partial f_{1}}{\partial x_{1}} & \frac{\partial f_{2}}{\partial x_{1}} \\ \frac{\partial f_{1}}{\partial x_{2}} & \frac{\partial f_{2}}{\partial x_{2}} \end{array}\right] } \\ =& {\left[\begin{array}{ll} 1 & 3 \\ 2 & 4 \end{array}\right] } \\ =& \mathbf{A}^{\top}(分母布局) \end{aligned} \]

或者，

\[ \begin{aligned} \frac{\partial \mathbf{A}\mathbf{x}}{\partial \mathbf{x}} = & {\left[\begin{array}{ll} \frac{\partial f_{1}}{\partial x_{1}} & \frac{\partial f_{1}}{\partial x_{2}} \\ \frac{\partial f_{2}}{\partial x_{1}} & \frac{\partial f_{2}}{\partial x_{2}} \end{array}\right] } \\ =& {\left[\begin{array}{ll} 1 & 2 \\ 3 & 4 \end{array}\right] } \\ =& \mathbf{A} (分子布局) \end{aligned} \]

上述求导结果的排列方式分别展示了分母布局与分子布局。

一般形式

向量一般可以被认为成一维矩阵，默认按列进行排列。向量对向量求导，如\(\partial \mathbf{y} / \partial \mathbf{x}\)，其中\(\mathbf{y}=\left[\begin{array}{lll}y_{1} & \cdots & y_{m}\end{array}\right]^{\top}\)以及\(\mathbf{x}=\left[\begin{array}{lll}x_{1} & \cdots & x_{n}\end{array}\right]^{\top}\)

于是\(\partial \mathbf{y} / \partial \mathbf{x}\)是一个拥有\(m \times n\)元素的矩阵，那么应该如何组织这个矩阵呢？目前有两种矩阵排列方式，它们分别是：分子布局（Numerator Layout），分母布局（Denominator Layout）

分子布局

一句话就是按照分子的排列方式进行排列，分子原来怎样排列，求导之后的结果就怎样排列，如：

\[ \begin{gathered} \frac{\partial \mathbf{y}}{\partial \mathbf{x}}=\left[\begin{array}{ccc} \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}} \end{array}\right] \\ \equiv \frac{\partial \mathbf{y}}{\partial \mathbf{x}^{\top}} \end{gathered} \]

上式结果中，分子\(\mathbf{y}\)的每个元素是是按照下标\(1 ...m\)按列排布，于是\(\frac{\partial \mathbf{y}}{\partial \mathbf{x}} \in \mathbb{R}^{m \times n}\)，这种形式也被叫做雅可比矩阵³（Jacobian matrix）。

当\(y\)是标量，\(\mathbf{x}\) 是向量时：

\[ \frac{\partial y}{\partial \mathbf{x}}=\left[\begin{array}{lll} \frac{\partial y}{\partial x_{1}} & \cdots & \frac{\partial y}{\partial x_{n}} \end{array}\right] \equiv \frac{\partial y}{\partial \mathbf{x}^{\top}} \]

上述分子布局在标量对向量的求导的数据排布中并不常见。

分母布局

\[ \frac{\partial \mathbf{y}}{\partial \mathbf{x}}=\left[\begin{array}{ccc} \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}} \end{array}\right] \]

上式结果中，分母\(\mathbf{x}\)的每个元素是是按照下标\(1 ...n\)按列排布，于是\(\frac{\partial \mathbf{y}}{\partial \mathbf{x}} \in \mathbb{R}^{n \times m}\)

当\(y\)是标量，\(\mathbb{x}\)是向量时： \[ \frac{\partial y}{\partial \mathbf{x}}=\left[\begin{array}{c} \frac{\partial y}{\partial x_{1}} \\ \vdots \\ \frac{\partial y}{\partial x_{n}} \end{array}\right] \]

这种标量对向量求导的情况非常常见，通常是以分母布局对求导结果进行排布。

那么向量求导两种方式结果数据排布方式的图示效果如下图所示¹：

以上两种形式比较容易搞混（通常在是否使用转置之间徘徊），在使用时务必要说明使用哪种布局！但是实际读论文时很少看到作者写明到底用的哪种，此时需要结合上下文进行判断，推理出论文公式使用的何种布局。另外，值得说明的是，如果作者没有明确说明，自己又懒得看，这时候你可以认为作者使用了“混合布局”，具体地：\(\frac{\partial \mathbf{y}}{\partial {x}}\) 按照分子布局，\(\frac{\partial {y}}{\partial \mathbf{x}}\) 按照分母布局³。

以分母布局为例，常用的矩阵求导公式有：

\[ \begin{aligned} \frac{\partial \mathbf{x}^{\top} \mathbf{a}}{\partial \mathbf{x}}&=\mathbf{a} \\ \frac{\partial \mathbf{A} \mathbf{x}}{\partial \mathbf{x}}&=\mathbf{A}^{\top} \\ \frac{\partial \mathbf{x}^{\top} \mathbf{A} \mathbf{x}}{\partial \mathbf{x}}&=\left(\mathbf{A}+\mathbf{A}^{\top}\right) \mathbf{x} \\ \frac{\partial \mathbf{u}^{\top}}{\partial \mathbf{x}}&= \left( \frac{\partial \mathbf{u}}{\partial \mathbf{x}} \right)^{\top} \\ \end{aligned} \]

这里有个小技巧，即分母布局中要加个转置，这是为什么呢？因为分母布局中要求按照分母的排列方式进行组织（一般为列），而分子呢，则"被迫"需要进行转置，反映在求导结果上也就需要转置。

当\(\mathbf{W}\)为对称矩阵时，我们有如下公式²： \[ \begin{aligned} \frac{\partial}{\partial \mathbf{s}}(\mathbf{x}-\mathbf{A} \mathbf{s})^{T} \mathbf{W}(\mathbf{x}-\mathbf{A} \mathbf{s}) &=-2 \mathbf{A}^{T} \mathbf{W}(\mathbf{x}-\mathbf{A} \mathbf{s}) \\ \frac{\partial}{\partial \mathbf{x}}(\mathbf{x}-\mathbf{s})^{T} \mathbf{W}(\mathbf{x}-\mathbf{s}) &=2 \mathbf{W}(\mathbf{x}-\mathbf{s}) \\ \frac{\partial}{\partial \mathbf{s}}(\mathbf{x}-\mathbf{s})^{T} \mathbf{W}(\mathbf{x}-\mathbf{s}) &=-2 \mathbf{W}(\mathbf{x}-\mathbf{s}) \\ \frac{\partial}{\partial \mathbf{x}}(\mathbf{x}-\mathbf{A} \mathbf{s})^{T} \mathbf{W}(\mathbf{x}-\mathbf{A} \mathbf{s}) &=2 \mathbf{W}(\mathbf{x}-\mathbf{A} \mathbf{s}) \\ \frac{\partial}{\partial \mathbf{A}}(\mathbf{x}-\mathbf{A} \mathbf{s})^{T} \mathbf{W}(\mathbf{x}-\mathbf{A} \mathbf{s}) &=-2 \mathbf{W}(\mathbf{x}-\mathbf{A} \mathbf{s}) \mathbf{s}^{T} \end{aligned} \]

小节³

向量对向量求导

标量对向量求导

特别需要注意的是： \[ \begin{aligned} \frac{\partial \mathbf{u}^{\top} \mathbf{v}}{\partial \mathbf{x}} &= \mathbf{u}^{\top} \frac{\partial \mathbf{v}}{\partial \mathbf{x}}+\mathbf{v}^{\top} \frac{\partial \mathbf{u}}{\partial \mathbf{x}} (分子布局) \\ \frac{\partial \mathbf{u}^{\top} \mathbf{v}}{\partial \mathbf{x}} &= \frac{\partial \mathbf{u}}{\partial \mathbf{x}} \mathbf{v}+\frac{\partial \mathbf{v}}{\partial \mathbf{x}} \mathbf{u} (分母布局) \end{aligned} \]

其中\(\mathbf{u} = \mathbf{u(x)}, \mathbf{v} = \mathbf{v(x)}\), \(\mathbf{u}^{\top}\mathbf{v}\)为标量。

应用

最小化误差\(E\)：

\[ E=\sum_{i=1}^{n}\left(\mathbf{a}_{i}^{\top} \mathbf{x}-b_{i}\right)^{2}=\|\mathbf{A} \mathbf{x}-\mathbf{b}\|^{2} \]

推导过程如下：

\[ \begin{aligned} E=\|\mathbf{A x}-\mathbf{b}\|^{2} &=(\mathbf{A x}-\mathbf{b})^{\top}(\mathbf{A x}-\mathbf{b}) \\ &=\left(\mathbf{x}^{\top} \mathbf{A}^{\top}-\mathbf{b}^{\top}\right)(\mathbf{A a}-\mathbf{b}) \\ &=\mathbf{x}^{\top} \mathbf{A}^{\top} \mathbf{A x}-\mathbf{x}^{\top} \mathbf{A}^{\top} \mathbf{b}-\mathbf{b}^{\top} \mathbf{A x}+\mathbf{b}^{\top} \mathbf{b} \end{aligned} \]

我们对每项进行求导：

\[ \begin{aligned} \frac{ \partial{ \mathbf{x}^{\top} \mathbf{A}^{\top} \mathbf{A x}}} {\partial \mathbf{x}} =(\mathbf{A}^{\top} \mathbf{A} + \mathbf{A}^{\top} \mathbf{A}) \mathbf{x} &= 2\mathbf{A}^{\top} \mathbf{A}\mathbf{x} \\ \frac{\partial{ \mathbf{x}^{\top} \mathbf{A}^{\top} \mathbf{b}}} {\partial \mathbf{x}} &=\mathbf{A}^{\top} \mathbf{b} \\ \frac{\partial{ \mathbf{b}^{\top} \mathbf{A x}}} {\partial \mathbf{x}} &=(\mathbf{b}^{\top} \mathbf{A})^{\top} = \mathbf{A^{\top}b} \\ \frac{\partial{ \mathbf{b}^{\top} \mathbf{b}}} {\partial \mathbf{x}} &= \mathbf{0} (列向量) \end{aligned} \]

所以：

\[ \begin{aligned} \frac{\partial{ E}} {\partial \mathbf{x}} &= 2\mathbf{A}^{\top} \mathbf{A}\mathbf{x} - \mathbf{A}^{\top} \mathbf{b} - \mathbf{A^{\top}b} + \mathbf{0} \\ &= 2\mathbf{A}^{\top} \mathbf{A}\mathbf{x} - 2\mathbf{A}^{\top} \mathbf{b} \end{aligned} \]

令\(\frac{\partial{ E}}{\partial \mathbf{x}} = 0\)，我们有：

\[ \begin{aligned} 2\mathbf{A}^{\top} \mathbf{A}\mathbf{x} - 2\mathbf{A}^{\top} \mathbf{b} &= \mathbf{0}\\ \mathbf{A}^{\top} \mathbf{A}\mathbf{x} &= \mathbf{A}^{\top} \mathbf{b} \\ \mathbf{x} &= ( \mathbf{A}^{\top} \mathbf{A})^{-1}\mathbf{A}^{\top} \mathbf{b} \end{aligned} \]

\(\mathbf{x} = ( \mathbf{A}^{\top} \mathbf{A})^{-1}\mathbf{A}^{\top} \mathbf{b}\)就是上述线性最小二乘问题的解。

参考

1.Matrix Differentiation in Lecture CS5240 Theoretial Foundations in Multimedia, https://www.comp.nus.edu.sg/~cs5240/lecture/matrix-diff.pdf↩︎
2.Matrix Cookbook, http://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf↩︎
3.Matrix calculus, https://en.jinzhao.wiki/wiki/Matrix_calculus↩︎
4.Vector/Matrix Calculus More notes on matrix differentiation.↩︎
5.Matrix Differentiation (and some other stuff), Randal J. Barnes, Department of Civil Engineering, University of Minnesota.↩︎
6.Matrix Calculus(一款矩阵求导计算器), http://www.matrixcalculus.org/↩︎

简单例子

一般形式

分子布局

分母布局

小节3

应用

参考

小节³