神经网络高斯过程

神经网络高斯过程（英語：neural network Gaussian process，简称NNGP) 是一种特殊的高斯过程，可以看作一类特定人工神经网络序列的极限。具体而言，当多种神经网络架构的宽度趋于无穷时，其函数分布会收敛到一个高斯过程。^[1]^[2]^[3]^[4]^[5]^[6]^[7]^[8]

背景

贝叶斯网络是一种建模工具，它通过为事件分配概，来量化模型预测中的不确定性。深度学习和人工神经网络则是机器学习中的主流方法，用于构建能从训练样本中学习的计算模型。贝叶斯神经网络则将二者相融合，是一种参数与预测都具有概率性的神经网络。^[9]^[10]标准的神经网络常会对错误的预测赋予较高的置信度^[11]，而贝叶斯神经网络能够更准确地评估自身预测正确的可能性。

左图：包含两个隐藏层的贝叶斯神经网络，该网络将三维输入（下）转换为二维输出

(y_{1},y_{2})

（上）。右图：网络输出的概率密度函数

p(y_{1},y_{2})

，由网络的随机权重所决定。视频：随着网络宽度的增加，输出分布变得简单，最终在无限宽度极限下收敛到多元正态分布。

人工神经网络的计算可以表示成人工神经元构成的层序列，其中每一层的神经元数量称为层的宽度。当我们考察一个贝叶斯神经网络序列（见图），其所有层的宽度不断增加时，这个序列在函数分布上最终会收敛到一个神经网络高斯过程。这种无限宽度下的极限具有现实意义，因为在实践中更宽的网络通常会有更好的性能。^[12]^[4]^[13]同时，这一极限也为我们提供了一种评估网络性能的闭式方法。

除了作为贝叶斯神经网络的极限，神经网络高斯过程还出现在其他几种场景下：它描述了一个非贝叶斯宽人工神经网络在参数随机初始化之后、训练之前的输出函数分布；它可以作为神经正切核预测方程中的关键组成部分；它在深度信息传播中用以表征超参数和架构是否可以训练。^[14]它还与神经网络的其他大宽度极限有关。

适用范围

关于神经网络与高斯过程的第一个对应关系，最早可以追溯到Radford M. Neal在其1995年博士论文中的结果^[15]，当时他的导师是多伦多大学教授杰弗里·辛顿。Neal表示他的研究灵感来源于从事贝叶斯学习研究的戴维·J·C·麦凯。

如今，这一对应关系已被证明适用于多种架构，如单隐藏层贝叶斯神经网络^[15]、层宽趋于无穷时的深度全连接网络^[2]^[3]、通道数趋于无穷时的卷积神经网络^[4]^[5]^[6]、注意头数量趋于无穷时的Transformer网络^[16]、单元数趋于无穷时的循环网络^[8]等。事实上，这种对应关系对几乎所有神经网络架构都成立。只要一个架构可以完全由矩阵乘法和逐坐标的非线性运算来表达，那么它就存在一个无限宽度下的高斯过程极限。^[8]这一定义涵盖了由多层感知器、循环神经网络（如LSTM和GRU）、（任意维度或图上的）卷积、池化、跳跃连接、注意力、批量归一化及层归一化组成的所有前馈或循环神经网络。

图解说明

神经网络参数的每一组参数 $\theta$ 对应于由该网络所计算的特定函数。当我们指定网络参数的先验分布 $p(\theta )$ 时，也就等同于在网络可能实现的所有函数上确定了一个先验分布。对于许多网络架构而言，当其宽度趋于无穷时，这个函数空间上的分布会收敛到一个高斯过程。

图中直观地展示了这一概念。对于一维输出的神经网络 $z^{L}(\cdot ;\theta )$ ，该图的坐标轴表示网络对两个不同输入 $x$ 和 $x^{*}$ 的对应输出值。图中的每一个黑点都代表了一次随机采样：先从 $p(\theta )$ 中随机抽取一组参数，然后再计算两个输入值对应的输出对 $z^{L}(x;\theta )$ 和 $z^{L}(x^{*};\theta )$ 。而图中的红线则描绘了由 $p(\theta )$ 得到的输出对的联合概率分布。这是参数空间中 $p(\theta )$ 对应于函数空间中的分布。在无限宽的神经网络中，由于其函数分布是一个高斯过程，因此对任意有限的一组输入，其输出的联合分布必定是一个多元高斯分布。

讨论

无限宽全连接网络

本节针对全连接网络这一具体架构，讨论无限宽神经网络与高斯过程之间的对应关系。我们将提供了一个证明概要，旨在阐述这一对应关系成立的原因，并给出这一架构下神经网络高斯过程的具体函数形式。此处的证明概要主要遵循了Novak等人提出的方法。^[4]

网络架构

考虑一个全连接人工神经网络，其中 $x$ 为输入，参数 $\theta$ 由每一层 $l$ 的权重 $W^{l}$ 和偏置 $b^{l}$ 组成， $z^{l}$ 和 $y^{l}$ 则分别表示每一层的预激活值（非线性运算之前）和激活值（非线性运算之后）， $\phi (\cdot )$ 为逐点作用的非线性函数， $n^{l}$ 为层宽。为简单起见，输出向量 $z^{L}$ 的宽度 $n^{L+1}$ 取为 1。假设网络的参数具有先验分布 $p(\theta )$ ，其中每一个权重和偏置都独立地服从一个各向同性的高斯分布，而权重的方差与层宽成反比。该网络的结构如右图所示，并由以下方程组描述：

{\begin{aligned}x&\equiv {\text{input}}\\y^{l}(x)&=\left\{{\begin{array}{lcl}x&&l=0\\\phi \left(z^{l-1}(x)\right)&&l>0\end{array}}\right.\\z_{i}^{l}(x)&=\sum _{j}W_{ij}^{l}y_{j}^{l}(x)+b_{i}^{l}\\W_{ij}^{l}&\sim {\mathcal {N}}\left(0,{\frac {\sigma _{w}^{2}}{n^{l}}}\right)\\b_{i}^{l}&\sim {\mathcal {N}}\left(0,\sigma _{b}^{2}\right)\\\phi (\cdot )&\equiv {\text{nonlinearity}}\\y^{l}(x),z^{l-1}(x)&\in \mathbb {R} ^{n^{l}\times 1}\\n^{L+1}&=1\\\theta &=\left\{W^{0},b^{0},\dots ,W^{L},b^{L}\right\}\end{aligned}}

高斯过程 $z^{l}|y^{l}$

首先，我们注意到预激活值 $z^{l}$ 可以被描述为一个以激活值 $y^{l}$ 为条件的高斯过程。这一结论即便在有限宽度的网络中也成立。每个预激活值 $z_{i}^{l}$ 是一系列高斯随机变量（权重 $W_{ij}^{l}$ 和偏置 $b_{i}^{l}$ ）的加权和，而这一求和中每个高斯变量的系数都是之前的激活值 $y_{j}^{l}$ 。因为 $z_{i}^{l}$ 是零均值高斯随机变量的加权和，所以其本身也是零均值高斯随机变量。对于任意 $y^{l}$ ， $z^{l}$ 服从联合高斯分布，因此其可以被定义为以 $y^{l}$ 为条件的高斯过程。该高斯过程的协方差（即核函数）取决于权重方差 $\sigma _{w}^{2}$ 、偏置方差 $\sigma _{b}^{2}$ 以及激活值 $y^{l}$ 的二阶矩矩阵 $K^{l}$ ：

{\begin{aligned}z_{i}^{l}\mid y^{l}&\sim {\mathcal {GP}}\left(0,\sigma _{w}^{2}K^{l}+\sigma _{b}^{2}\right)\\K^{l}(x,x')&={\frac {1}{n^{l}}}\sum _{i}y_{i}^{l}(x)y_{i}^{l}(x')\end{aligned}}

其中权重方差 $\sigma _{w}^{2}$ 的作用是缩放来自 $K^{l}$ 的贡献，而偏置由于是对所有输入共享的，因此 $\sigma _{b}^{2}$ 使得 $z_{i}^{l}$ 更接近常数矩阵，即让不同输入数据点所对应的 $z_{i}^{l}$ 更相似。

高斯过程 $z^{l}|K^{l}$

预激活值 $z^{l}$ 对 $y^{l}$ 的依赖仅取决于其二阶矩矩阵 $K^{l}$ 。正因如此，我们可以说 $z^{l}$ 是一个以 $K^{l}$ 为条件的高斯过程，而无需取决于整个激活向量 $y^{l}$ ：

{\begin{aligned}z_{i}^{l}\mid K^{l}&\sim {\mathcal {GP}}\left(0,\sigma _{w}^{2}K^{l}+\sigma _{b}^{2}\right).\end{aligned}}

层宽趋于无穷时 $K^{l}\mid K^{l-1}$ 的确定性

如前所述， $K^{l}$ 是激活值 $y^{l}$ 的二阶矩矩阵。由于 $y^{l}$ 是对预激活值 $z^{l-1}$ 应用非线性函数 $\phi$ 的结果，可以将其替换为 $\phi \left(z^{l-1}\right)$ ，从而将 $K^{l}$ （ $l>0$ ）的定义改写为

{\begin{aligned}K^{l}(x,x')&={\frac {1}{n^{l}}}\sum _{i}\phi \left(z_{i}^{l-1}(x)\right)\phi \left(z_{i}^{l-1}(x')\right).\end{aligned}}

前文已证明 $z^{l-1}|K^{l-1}$ 是一个高斯过程。这意味着，上述 $K^{l}$ 定义中的求和项，实现上是 $n^{l}$ 个样本的平均，这些样本都是从基于 $K^{l-1}$ 为的高斯过程中采样得到的，即

{\begin{aligned}\left\{z_{i}^{l-1}(x),z_{i}^{l-1}(x')\right\}&\sim {\mathcal {GP}}\left(0,\sigma _{w}^{2}K^{l-1}+\sigma _{b}^{2}\right).\end{aligned}}

随着层宽 $n^{l}$ 趋于无穷大，这 $n^{l}$ 个高斯过程样本的均值会收敛到一个高斯过程上的积分：

{\begin{aligned}\lim _{n^{l}\rightarrow \infty }K^{l}(x,x')&=\int dz\,dz'\,\phi (z)\,\phi (z')\,{\mathcal {N}}\left(\left[{\begin{array}{c}z\\z'\end{array}}\right];0,\sigma _{w}^{2}\left[{\begin{array}{cc}K^{l-1}(x,x)&K^{l-1}(x,x')\\K^{l-1}(x',x)&K^{l-1}(x',x')\end{array}}\right]+\sigma _{b}^{2}\right)\end{aligned}}

因此，在无限宽度极限下，输入对 $x$ 和 $x'$ 的二阶矩矩阵 $K^{l}$ 可以通过一个关于二维高斯分布的积分来计算。对许多常见的激活函数 $\phi (\cdot )$ ，例如ReLU^[17]、ELU、GELU^[18]或误差函数^[1]等，这一积分都存在解析解。即使在没有解析解的情况下，由于它只是一个二维积分，通常也可以通过数值方法高效地计算。^[2]由于该积分是确定性的， $K^{l}|K^{l-1}$ 也是确定性的。

为了方便表示，我们定义一个泛函 $F$ ，它代表了上述积分的计算过程，并将前一层的 $K^{l-1}$ 映射到当前层的 $K^{l}$ ：

{\begin{aligned}\lim _{n^{l}\rightarrow \infty }K^{l}&=F\left(K^{l-1}\right).\end{aligned}}

神经网络高斯过程 $z^{L}\mid x$

上节我们得到， $n^{l}\rightarrow \infty$ 时 $K^{l}\mid K^{l-1}$ 是确定性的。递归地应用这一结论，最后一层的 $K^{L}$ 可以写成输入层 $K^{0}$ 的确定性函数：

{\begin{aligned}\lim _{\min \left(n^{1},\dots ,n^{L}\right)\rightarrow \infty }K^{L}&=F\circ F\cdots \left(K^{0}\right)=F^{L}\left(K^{0}\right),\end{aligned}}

其中， $F^{L}$ 表示将 $F$ 连续应用 $L$ 次。同时，输入层的二阶矩矩阵 $K^{0}(x,x')={\tfrac {1}{n^{0}}}\sum _{i}x_{i}x'_{i}$ 也是输入 $x$ 的确定性函数，加之我们已证明 $z^{L}|K^{L}$ 是一个高斯过程，最终我们可以将神经网络的输出表示为输入的高斯过程：

{\begin{aligned}z_{i}^{L}(x)&\sim {\mathcal {GP}}\left(0,\sigma _{w}^{2}F^{L}\left(K^{0}\right)+\sigma _{b}^{2}\right).\end{aligned}}

软件库

Neural Tangents是一个由Google开发的免费开源 Python库，可以用于计算和推断与各常见神经网络架构相对应的神经网络高斯过程和神经正切核。^[19]

参考文献

^ ^1.0 ^1.1 Williams, Christopher K. I. Computing with infinite networks. Neural Information Processing Systems. 1997.
^ ^2.0 ^2.1 ^2.2 Lee, Jaehoon; Bahri, Yasaman; Novak, Roman; Schoenholz, Samuel S.; Pennington, Jeffrey; Sohl-Dickstein, Jascha. Deep Neural Networks as Gaussian Processes. International Conference on Learning Representations. 2017. Bibcode:2017arXiv171100165L. arXiv:1711.00165 .
^ ^3.0 ^3.1 G. de G. Matthews, Alexander; Rowland, Mark; Hron, Jiri; Turner, Richard E.; Ghahramani, Zoubin. Gaussian Process Behaviour in Wide Deep Neural Networks. International Conference on Learning Representations. 2017. Bibcode:2018arXiv180411271M. arXiv:1804.11271 .
^ ^4.0 ^4.1 ^4.2 ^4.3 Novak, Roman; Xiao, Lechao; Lee, Jaehoon; Bahri, Yasaman; Yang, Greg; Abolafia, Dan; Pennington, Jeffrey; Sohl-Dickstein, Jascha. Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes. International Conference on Learning Representations. 2018. Bibcode:2018arXiv181005148N. arXiv:1810.05148 . Novak, Roman; Xiao, Lechao; Lee, Jaehoon; Bahri, Yasaman; Yang, Greg; Abolafia, Dan; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2018). "Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes". International Conference on Learning Representations. arXiv:1810.05148. Bibcode:2018arXiv181005148N.
^ ^5.0 ^5.1 Garriga-Alonso, Adrià; Aitchison, Laurence; Rasmussen, Carl Edward. Deep Convolutional Networks as shallow Gaussian Processes. International Conference on Learning Representations. 2018. Bibcode:2018arXiv180805587G. arXiv:1808.05587 .
^ ^6.0 ^6.1 Borovykh, Anastasia. A Gaussian Process perspective on Convolutional Neural Networks. 2018. arXiv:1810.10798  [stat.ML].
^ Tsuchida, Russell; Pearce, Tim. Avoiding Kernel Fixed Points: Computing with ELU and GELU Infinite Networks. 2020. arXiv:2002.08517  [cs.LG].
^ ^8.0 ^8.1 ^8.2 Yang, Greg. Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes (PDF). Advances in Neural Information Processing Systems. 2019. Bibcode:2019arXiv191012478Y. arXiv:1910.12478 .
^ MacKay, David J. C. A Practical Bayesian Framework for Backpropagation Networks. Neural Computation. 1992, 4 (3): 448–472. ISSN 0899-7667. S2CID 16543854. doi:10.1162/neco.1992.4.3.448.
^ Neal, Radford M. Bayesian Learning for Neural Networks. Springer Science and Business Media. 2012.
^ Guo, Chuan; Pleiss, Geoff; Sun, Yu; Weinberger, Kilian Q. On calibration of modern neural networks. Proceedings of the 34th International Conference on Machine Learning-Volume 70. 2017. arXiv:1706.04599 .
^ Novak, Roman; Bahri, Yasaman; Abolafia, Daniel A.; Pennington, Jeffrey; Sohl-Dickstein, Jascha. Sensitivity and Generalization in Neural Networks: an Empirical Study. International Conference on Learning Representations. 2018-02-15. Bibcode:2018arXiv180208760N. arXiv:1802.08760 .
^ Neyshabur, Behnam; Li, Zhiyuan; Bhojanapalli, Srinadh; LeCun, Yann; Srebro, Nathan. Towards understanding the role of over-parametrization in generalization of neural networks. International Conference on Learning Representations. 2019. Bibcode:2018arXiv180512076N. arXiv:1805.12076 .
^ Schoenholz, Samuel S.; Gilmer, Justin; Ganguli, Surya; Sohl-Dickstein, Jascha. Deep information propagation. International Conference on Learning Representations. 2016. arXiv:1611.01232 .
^ ^15.0 ^15.1 Neal, Radford M., Priors for Infinite Networks, Bayesian Learning for Neural Networks, Lecture Notes in Statistics 118, Springer New York: 29–53, 1996, ISBN 978-0-387-94724-2, doi:10.1007/978-1-4612-0745-0_2
^ Hron, Jiri; Bahri, Yasaman; Sohl-Dickstein, Jascha; Novak, Roman. Infinite attention: NNGP and NTK for deep attention networks. International Conference on Machine Learning. 2020-06-18, 2020. Bibcode:2020arXiv200610540H. arXiv:2006.10540 .
^ Cho, Youngmin; Saul, Lawrence K. Kernel Methods for Deep Learning. Neural Information Processing Systems. 2009, 22: 342–350.
^ Tsuchida, Russell; Pearce, Tim. Avoiding Kernel Fixed Points: Computing with ELU and GELU Infinite Networks. 2020. arXiv:2002.08517  [cs.LG].
^ Novak, Roman; Xiao, Lechao; Hron, Jiri; Lee, Jaehoon; Alemi, Alexander A.; Sohl-Dickstein, Jascha; Schoenholz, Samuel S., Neural Tangents: Fast and Easy Infinite Neural Networks in Python, International Conference on Learning Representations (ICLR), 2019-12-05, 2020, Bibcode:2019arXiv191202803N, arXiv:1912.02803

[:11-1] 1.0 ^1.1 Williams, Christopher K. I. Computing with infinite networks. Neural Information Processing Systems. 1997.

[:0-2] 2.0 ^2.1 ^2.2 Lee, Jaehoon; Bahri, Yasaman; Novak, Roman; Schoenholz, Samuel S.; Pennington, Jeffrey; Sohl-Dickstein, Jascha. Deep Neural Networks as Gaussian Processes. International Conference on Learning Representations. 2017. Bibcode:2017arXiv171100165L. arXiv:1711.00165 .

[:3-3] 3.0 ^3.1 G. de G. Matthews, Alexander; Rowland, Mark; Hron, Jiri; Turner, Richard E.; Ghahramani, Zoubin. Gaussian Process Behaviour in Wide Deep Neural Networks. International Conference on Learning Representations. 2017. Bibcode:2018arXiv180411271M. arXiv:1804.11271 .

[:1-4] 4.0 ^4.1 ^4.2 ^4.3 Novak, Roman; Xiao, Lechao; Lee, Jaehoon; Bahri, Yasaman; Yang, Greg; Abolafia, Dan; Pennington, Jeffrey; Sohl-Dickstein, Jascha. Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes. International Conference on Learning Representations. 2018. Bibcode:2018arXiv181005148N. arXiv:1810.05148 . Novak, Roman; Xiao, Lechao; Lee, Jaehoon; Bahri, Yasaman; Yang, Greg; Abolafia, Dan; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2018). "Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes". International Conference on Learning Representations. arXiv:1810.05148. Bibcode:2018arXiv181005148N.

[:4-5] 5.0 ^5.1 Garriga-Alonso, Adrià; Aitchison, Laurence; Rasmussen, Carl Edward. Deep Convolutional Networks as shallow Gaussian Processes. International Conference on Learning Representations. 2018. Bibcode:2018arXiv180805587G. arXiv:1808.05587 .

[:9-6] 6.0 ^6.1 Borovykh, Anastasia. A Gaussian Process perspective on Convolutional Neural Networks. 2018. arXiv:1810.10798  [stat.ML].

[7] Tsuchida, Russell; Pearce, Tim. Avoiding Kernel Fixed Points: Computing with ELU and GELU Infinite Networks. 2020. arXiv:2002.08517  [cs.LG].

[:5-8] 8.0 ^8.1 ^8.2 Yang, Greg. Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes (PDF). Advances in Neural Information Processing Systems. 2019. Bibcode:2019arXiv191012478Y. arXiv:1910.12478 .

[9] MacKay, David J. C. A Practical Bayesian Framework for Backpropagation Networks. Neural Computation. 1992, 4 (3): 448–472. ISSN 0899-7667. S2CID 16543854. doi:10.1162/neco.1992.4.3.448.

[10] Neal, Radford M. Bayesian Learning for Neural Networks. Springer Science and Business Media. 2012.

[11] Guo, Chuan; Pleiss, Geoff; Sun, Yu; Weinberger, Kilian Q. On calibration of modern neural networks. Proceedings of the 34th International Conference on Machine Learning-Volume 70. 2017. arXiv:1706.04599 .

[:7-12] Novak, Roman; Bahri, Yasaman; Abolafia, Daniel A.; Pennington, Jeffrey; Sohl-Dickstein, Jascha. Sensitivity and Generalization in Neural Networks: an Empirical Study. International Conference on Learning Representations. 2018-02-15. Bibcode:2018arXiv180208760N. arXiv:1802.08760 .

[:6-13] Neyshabur, Behnam; Li, Zhiyuan; Bhojanapalli, Srinadh; LeCun, Yann; Srebro, Nathan. Towards understanding the role of over-parametrization in generalization of neural networks. International Conference on Learning Representations. 2019. Bibcode:2018arXiv180512076N. arXiv:1805.12076 .

[:10-14] Schoenholz, Samuel S.; Gilmer, Justin; Ganguli, Surya; Sohl-Dickstein, Jascha. Deep information propagation. International Conference on Learning Representations. 2016. arXiv:1611.01232 .

[:2-15] 15.0 ^15.1 Neal, Radford M., Priors for Infinite Networks, Bayesian Learning for Neural Networks, Lecture Notes in Statistics 118, Springer New York: 29–53, 1996, ISBN 978-0-387-94724-2, doi:10.1007/978-1-4612-0745-0_2

[16] Hron, Jiri; Bahri, Yasaman; Sohl-Dickstein, Jascha; Novak, Roman. Infinite attention: NNGP and NTK for deep attention networks. International Conference on Machine Learning. 2020-06-18, 2020. Bibcode:2020arXiv200610540H. arXiv:2006.10540 .

[17] Cho, Youngmin; Saul, Lawrence K. Kernel Methods for Deep Learning. Neural Information Processing Systems. 2009, 22: 342–350.

[18] Tsuchida, Russell; Pearce, Tim. Avoiding Kernel Fixed Points: Computing with ELU and GELU Infinite Networks. 2020. arXiv:2002.08517  [cs.LG].

[19] Novak, Roman; Xiao, Lechao; Hron, Jiri; Lee, Jaehoon; Alemi, Alexander A.; Sohl-Dickstein, Jascha; Schoenholz, Samuel S., Neural Tangents: Fast and Easy Infinite Neural Networks in Python, International Conference on Learning Representations (ICLR), 2019-12-05, 2020, Bibcode:2019arXiv191202803N, arXiv:1912.02803 

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

背景

适用范围

图解说明

讨论

无限宽全连接网络

网络架构

高斯过程 z l | y l {\displaystyle z^{l}|y^{l}}

高斯过程 z l | K l {\displaystyle z^{l}|K^{l}}

层宽趋于无穷时 K l ∣ K l − 1 {\displaystyle K^{l}\mid K^{l-1}} 的确定性

神经网络高斯过程 z L ∣ x {\displaystyle z^{L}\mid x}

软件库

参考文献

高斯过程 $z^{l}|y^{l}$

高斯过程 $z^{l}|K^{l}$

层宽趋于无穷时 $K^{l}\mid K^{l-1}$ 的确定性

神经网络高斯过程 $z^{L}\mid x$