📝笔记:简明矩阵求导术之分子布局与分母布局

矩阵或者向量求导时经常会被分子/分母布局搞得头大,如什么时候转置,什么时候不转置。本文将简明介绍常用的矩阵/向量求导技巧。

简单例子

  1. \(\mathbf{a}^{\top} \mathbf{x}\) 对向量\(\mathbf{x}\)求导,举个例子: 令\(\mathbf{a} = \left[\begin{array}{ll} 1 \\ 2 \end{array}\right]\)\(\mathbf{x} = \left[\begin{array}{ll} x_1 \\ x_2 \end{array}\right]\),则 \(\mathbf{a}^{\top} \mathbf{x} =x_1+ 2x_2\)

于是

\[ \begin{aligned} \frac{\partial \mathbf{a}^{\top}\mathbf{x}}{\partial \mathbf{x}} = & {\left[\begin{array}{ll} \frac{\partial (x_1+ 2x_2)}{\partial x_{1}} \\ \frac{\partial (x_1+ 2x_2)}{\partial x_{2}} \end{array}\right] } \\ =& {\left[\begin{array}{ll} 1 \\ 2 \end{array}\right] } \\ =& \mathbf{a} (分母布局) \end{aligned} \]

注意,上述结果以分母布局进行排布,具体见后一节的一般形式。

  1. \(\mathbf{A} \mathbf{x}\) 对向量\(\mathbf{x}\)求导,可以举一个具体的例子对求导过程进行推导。

\(\mathbf{A} = \left[\begin{array}{ll} 1 & 2 \\ 3 & 4 \end{array}\right]\)\(\mathbf{x} = \left[\begin{array}{ll} x_1 \\ x_2 \end{array}\right]\) 则:

\[ \begin{aligned} \mathbf{A x} =& {\left[\begin{array}{ll} 1 & 2 \\ 3 & 4 \end{array}\right]\left[\begin{array}{l} x_{1} \\ x_{2} \end{array}\right] } \\ =& {\left[\begin{array}{l} x_{1}+2 x_{2} \\ 3 x_{1}+4 x_{2} \end{array}\right] } \\ =&\left[\begin{array}{c} f_{1} \\ f_{2} \end{array}\right] \end{aligned} \]

所以,

\[ \begin{aligned} \frac{\partial \mathbf{A}\mathbf{x}}{\partial \mathbf{x}} = & {\left[\begin{array}{ll} \frac{\partial f_{1}}{\partial x_{1}} & \frac{\partial f_{2}}{\partial x_{1}} \\ \frac{\partial f_{1}}{\partial x_{2}} & \frac{\partial f_{2}}{\partial x_{2}} \end{array}\right] } \\ =& {\left[\begin{array}{ll} 1 & 3 \\ 2 & 4 \end{array}\right] } \\ =& \mathbf{A}^{\top}(分母布局) \end{aligned} \]

或者,

\[ \begin{aligned} \frac{\partial \mathbf{A}\mathbf{x}}{\partial \mathbf{x}} = & {\left[\begin{array}{ll} \frac{\partial f_{1}}{\partial x_{1}} & \frac{\partial f_{1}}{\partial x_{2}} \\ \frac{\partial f_{2}}{\partial x_{1}} & \frac{\partial f_{2}}{\partial x_{2}} \end{array}\right] } \\ =& {\left[\begin{array}{ll} 1 & 2 \\ 3 & 4 \end{array}\right] } \\ =& \mathbf{A} (分子布局) \end{aligned} \]

上述求导结果的排列方式分别展示了分母布局与分子布局。

一般形式

向量一般可以被认为成一维矩阵,默认按列进行排列。 向量对向量求导,如\(\partial \mathbf{y} / \partial \mathbf{x}\),其中\(\mathbf{y}=\left[\begin{array}{lll}y_{1} & \cdots & y_{m}\end{array}\right]^{\top}\)以及\(\mathbf{x}=\left[\begin{array}{lll}x_{1} & \cdots & x_{n}\end{array}\right]^{\top}\)

于是\(\partial \mathbf{y} / \partial \mathbf{x}\)是一个拥有\(m \times n\)元素的矩阵,那么应该如何组织这个矩阵呢?目前有两种矩阵排列方式,它们分别是:分子布局(Numerator Layout),分母布局(Denominator Layout)

分子布局

一句话就是按照分子的排列方式进行排列,分子原来怎样排列,求导之后的结果就怎样排列,如:

\[ \begin{gathered} \frac{\partial \mathbf{y}}{\partial \mathbf{x}}=\left[\begin{array}{ccc} \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}} \end{array}\right] \\ \equiv \frac{\partial \mathbf{y}}{\partial \mathbf{x}^{\top}} \end{gathered} \]

上式结果中,分子\(\mathbf{y}\)的每个元素是是按照下标\(1 ...m\)按列排布,于是\(\frac{\partial \mathbf{y}}{\partial \mathbf{x}} \in \mathbb{R}^{m \times n}\),这种形式也被叫做雅可比矩阵3Jacobian matrix)。

\(y\)是标量,\(\mathbf{x}\) 是向量时:

\[ \frac{\partial y}{\partial \mathbf{x}}=\left[\begin{array}{lll} \frac{\partial y}{\partial x_{1}} & \cdots & \frac{\partial y}{\partial x_{n}} \end{array}\right] \equiv \frac{\partial y}{\partial \mathbf{x}^{\top}} \]

上述分子布局在标量对向量的求导的数据排布中并不常见。

分母布局

\[ \frac{\partial \mathbf{y}}{\partial \mathbf{x}}=\left[\begin{array}{ccc} \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}} \end{array}\right] \]

上式结果中,分母\(\mathbf{x}\)的每个元素是是按照下标\(1 ...n\)按列排布,于是\(\frac{\partial \mathbf{y}}{\partial \mathbf{x}} \in \mathbb{R}^{n \times m}\)

\(y\)是标量,\(\mathbb{x}\)是向量时: \[ \frac{\partial y}{\partial \mathbf{x}}=\left[\begin{array}{c} \frac{\partial y}{\partial x_{1}} \\ \vdots \\ \frac{\partial y}{\partial x_{n}} \end{array}\right] \]

这种标量对向量求导的情况非常常见,通常是以分母布局对求导结果进行排布

那么向量求导两种方式结果数据排布方式的图示效果如下图所示1

分子-分母布局图示

以上两种形式比较容易搞混(通常在是否使用转置之间徘徊),在使用时务必要说明使用哪种布局!但是实际读论文时很少看到作者写明到底用的哪种,此时需要结合上下文进行判断,推理出论文公式使用的何种布局。另外,值得说明的是,如果作者没有明确说明,自己又懒得看,这时候你可以认为作者使用了“混合布局”,具体地:\(\frac{\partial \mathbf{y}}{\partial {x}}\) 按照分子布局,\(\frac{\partial {y}}{\partial \mathbf{x}}\) 按照分母布局3

以分母布局为例,常用的矩阵求导公式有:

\[ \begin{aligned} \frac{\partial \mathbf{x}^{\top} \mathbf{a}}{\partial \mathbf{x}}&=\mathbf{a} \\ \frac{\partial \mathbf{A} \mathbf{x}}{\partial \mathbf{x}}&=\mathbf{A}^{\top} \\ \frac{\partial \mathbf{x}^{\top} \mathbf{A} \mathbf{x}}{\partial \mathbf{x}}&=\left(\mathbf{A}+\mathbf{A}^{\top}\right) \mathbf{x} \\ \frac{\partial \mathbf{u}^{\top}}{\partial \mathbf{x}}&= \left( \frac{\partial \mathbf{u}}{\partial \mathbf{x}} \right)^{\top} \\ \end{aligned} \]

这里有个小技巧,即分母布局中要加个转置,这是为什么呢?因为分母布局中要求按照分母的排列方式进行组织(一般为列),而分子呢,则"被迫"需要进行转置,反映在求导结果上也就需要转置。

\(\mathbf{W}\)为对称矩阵时,我们有如下公式2\[ \begin{aligned} \frac{\partial}{\partial \mathbf{s}}(\mathbf{x}-\mathbf{A} \mathbf{s})^{T} \mathbf{W}(\mathbf{x}-\mathbf{A} \mathbf{s}) &=-2 \mathbf{A}^{T} \mathbf{W}(\mathbf{x}-\mathbf{A} \mathbf{s}) \\ \frac{\partial}{\partial \mathbf{x}}(\mathbf{x}-\mathbf{s})^{T} \mathbf{W}(\mathbf{x}-\mathbf{s}) &=2 \mathbf{W}(\mathbf{x}-\mathbf{s}) \\ \frac{\partial}{\partial \mathbf{s}}(\mathbf{x}-\mathbf{s})^{T} \mathbf{W}(\mathbf{x}-\mathbf{s}) &=-2 \mathbf{W}(\mathbf{x}-\mathbf{s}) \\ \frac{\partial}{\partial \mathbf{x}}(\mathbf{x}-\mathbf{A} \mathbf{s})^{T} \mathbf{W}(\mathbf{x}-\mathbf{A} \mathbf{s}) &=2 \mathbf{W}(\mathbf{x}-\mathbf{A} \mathbf{s}) \\ \frac{\partial}{\partial \mathbf{A}}(\mathbf{x}-\mathbf{A} \mathbf{s})^{T} \mathbf{W}(\mathbf{x}-\mathbf{A} \mathbf{s}) &=-2 \mathbf{W}(\mathbf{x}-\mathbf{A} \mathbf{s}) \mathbf{s}^{T} \end{aligned} \]

小节3

  • 向量对向量求导

  • 标量对向量求导

特别需要注意的是: \[ \begin{aligned} \frac{\partial \mathbf{u}^{\top} \mathbf{v}}{\partial \mathbf{x}} &= \mathbf{u}^{\top} \frac{\partial \mathbf{v}}{\partial \mathbf{x}}+\mathbf{v}^{\top} \frac{\partial \mathbf{u}}{\partial \mathbf{x}} (分子布局) \\ \frac{\partial \mathbf{u}^{\top} \mathbf{v}}{\partial \mathbf{x}} &= \frac{\partial \mathbf{u}}{\partial \mathbf{x}} \mathbf{v}+\frac{\partial \mathbf{v}}{\partial \mathbf{x}} \mathbf{u} (分母布局) \end{aligned} \]

其中\(\mathbf{u} = \mathbf{u(x)}, \mathbf{v} = \mathbf{v(x)}\), \(\mathbf{u}^{\top}\mathbf{v}\)为标量。

应用

最小化误差\(E\)

\[ E=\sum_{i=1}^{n}\left(\mathbf{a}_{i}^{\top} \mathbf{x}-b_{i}\right)^{2}=\|\mathbf{A} \mathbf{x}-\mathbf{b}\|^{2} \]

推导过程如下:

\[ \begin{aligned} E=\|\mathbf{A x}-\mathbf{b}\|^{2} &=(\mathbf{A x}-\mathbf{b})^{\top}(\mathbf{A x}-\mathbf{b}) \\ &=\left(\mathbf{x}^{\top} \mathbf{A}^{\top}-\mathbf{b}^{\top}\right)(\mathbf{A a}-\mathbf{b}) \\ &=\mathbf{x}^{\top} \mathbf{A}^{\top} \mathbf{A x}-\mathbf{x}^{\top} \mathbf{A}^{\top} \mathbf{b}-\mathbf{b}^{\top} \mathbf{A x}+\mathbf{b}^{\top} \mathbf{b} \end{aligned} \]

我们对每项进行求导:

\[ \begin{aligned} \frac{ \partial{ \mathbf{x}^{\top} \mathbf{A}^{\top} \mathbf{A x}}} {\partial \mathbf{x}} =(\mathbf{A}^{\top} \mathbf{A} + \mathbf{A}^{\top} \mathbf{A}) \mathbf{x} &= 2\mathbf{A}^{\top} \mathbf{A}\mathbf{x} \\ \frac{\partial{ \mathbf{x}^{\top} \mathbf{A}^{\top} \mathbf{b}}} {\partial \mathbf{x}} &=\mathbf{A}^{\top} \mathbf{b} \\ \frac{\partial{ \mathbf{b}^{\top} \mathbf{A x}}} {\partial \mathbf{x}} &=(\mathbf{b}^{\top} \mathbf{A})^{\top} = \mathbf{A^{\top}b} \\ \frac{\partial{ \mathbf{b}^{\top} \mathbf{b}}} {\partial \mathbf{x}} &= \mathbf{0} (列向量) \end{aligned} \]

所以:

\[ \begin{aligned} \frac{\partial{ E}} {\partial \mathbf{x}} &= 2\mathbf{A}^{\top} \mathbf{A}\mathbf{x} - \mathbf{A}^{\top} \mathbf{b} - \mathbf{A^{\top}b} + \mathbf{0} \\ &= 2\mathbf{A}^{\top} \mathbf{A}\mathbf{x} - 2\mathbf{A}^{\top} \mathbf{b} \end{aligned} \]

\(\frac{\partial{ E}}{\partial \mathbf{x}} = 0\),我们有:

\[ \begin{aligned} 2\mathbf{A}^{\top} \mathbf{A}\mathbf{x} - 2\mathbf{A}^{\top} \mathbf{b} &= \mathbf{0}\\ \mathbf{A}^{\top} \mathbf{A}\mathbf{x} &= \mathbf{A}^{\top} \mathbf{b} \\ \mathbf{x} &= ( \mathbf{A}^{\top} \mathbf{A})^{-1}\mathbf{A}^{\top} \mathbf{b} \end{aligned} \]

\(\mathbf{x} = ( \mathbf{A}^{\top} \mathbf{A})^{-1}\mathbf{A}^{\top} \mathbf{b}\)就是上述线性最小二乘问题的解。

参考


  1. 1.Matrix Differentiation in Lecture CS5240 Theoretial Foundations in Multimedia, https://www.comp.nus.edu.sg/~cs5240/lecture/matrix-diff.pdf↩︎
  2. 2.Matrix Cookbook, http://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf↩︎
  3. 3.Matrix calculus, https://en.jinzhao.wiki/wiki/Matrix_calculus↩︎
  4. 4.Vector/Matrix Calculus More notes on matrix differentiation.↩︎
  5. 5.Matrix Differentiation (and some other stuff), Randal J. Barnes, Department of Civil Engineering, University of Minnesota.↩︎
  6. 6.Matrix Calculus(一款矩阵求导计算器), http://www.matrixcalculus.org/↩︎