A guide to convolution arithmetic for deep learning - 深度学习卷积运算指南

Vincent Dumoulin, Francesco Visin
MILA, Université de Montréal
AIRLab, Politecnico di Milano

The Université de Montréal (UdeM) is a French-language public research university in Montreal, Quebec, Canada.
Université de Montréal (UdeM)，University of Montreal：蒙特利尔大学
Montreal Institute for Learning Algorithms，MILA
Politecnico di Milano (POLIMI)，Polytechnic University of Milan：米兰理工大学
Artificial Intelligence and Robotics Lab，AIRLab


All models are wrong, but some are useful. - George E. P. Box

Acknowledgements
The authors of this guide would like to thank David Warde-Farley, Guillaume Alain and Caglar Gulcehre for their valuable feedback. We are likewise grateful to all those who helped improve this tutorial with helpful comments, constructive criticisms and code contributions. Keep them coming!
Special thanks to Ethan Schoonover, creator of the Solarized color scheme,1 whose colors were used for the figures.

Feedback
Your feedback is welcomed! We did our best to be as precise, informative and up to the point as possible, but should there be anything you feel might be an error or could be rephrased to be more precise or comprehensible, please don’t refrain from contacting us. Likewise, drop us a line if you think there is something that might fit this technical report and you would like us to discuss - we will make our best effort to update this document.

refrain [rɪ'freɪn]：vi. 节制，克制，避免，制止 n. 叠句，副歌，重复
drop us a line：联系我们


Source code and animations
The code used to generate this guide along with its figures is available on GitHub.2 There the reader can also find an animated version of the figures.

Contents

1 Introduction
1.1 Discrete convolutions
1.2 Pooling
2 Convolution arithmetic
2.1 No zero padding, unit strides
2.3 No zero padding, non-unit strides
3 Pooling arithmetic
4 Transposed convolution arithmetic
4.1 Convolution as a matrix operation
4.2 Transposed convolution
4.3 No zero padding, unit strides, transposed
4.4 Zero padding, unit strides, transposed
4.5 No zero padding, non-unit strides, transposed
4.6 Zero padding, non-unit strides, transposed
5 Miscellaneous convolutions
5.1 Dilated convolutions

zero padding：零填充，补零
dilated convolution：扩张卷积，空洞卷积
cross-correlation：互相关
deconvolution：反卷积，逆卷积
A transposed 2-D convolution layer upsamples feature maps. This layer is sometimes incorrectly known as a deconvolution or deconv layer. This layer is the transpose of convolution and does not perform deconvolution.
transposed convolution：转置卷积
transpose [træns'pəʊz]：vt. 调换，移项，颠倒顺序 vi. 进行变换 n. 转置阵


Chapter 1 Introduction

Deep convolutional neural networks (CNNs) have been at the heart of spectacular advances in deep learning. Although CNNs have been used as early as the nineties to solve character recognition tasks (Le Cun et al., 1997), their current widespread application is due to much more recent work, when a deep CNN was used to beat state-of-the-art in the ImageNet image classification challenge (Krizhevsky et al., 2012).

at the heart of：位于...的中心


Convolutional neural networks therefore constitute a very useful tool for machine learning practitioners. However, learning to use CNNs for the first time is generally an intimidating experience. A convolutional layer’s output shape is affected by the shape of its input as well as the choice of kernel shape, zero padding and strides, and the relationship between these properties is not trivial to infer. This contrasts with fully-connected layers, whose output size is independent of the input size. Additionally, CNNs also usually feature a pooling stage, adding yet another level of complexity with respect to fully-connected networks. Finally, so-called transposed convolutional layers (also known as fractionally strided convolutional layers) have been employed in more and more work as of late (Zeiler et al., 2011; Zeiler and Fergus, 2014; Long et al., 2015; Radford et al., 2015; Visin et al., 2015; Im et al., 2016), and their relationship with convolutional layers has been explained with various degrees of clarity.

practitioner [præk'tɪʃ(ə)nə]：n. 开业者，从业者，执业医生
clarity ['klærɪtɪ]：n. 清楚，明晰，透明


This guide’s objective is twofold:

1. Explain the relationship between convolutional layers and transposed convolutional layers.
2. Provide an intuitive understanding of the relationship between input shape, kernel shape, zero padding, strides and output shape in convolutional, pooling and transposed convolutional layers.
twofold ['tuːfəʊld]：adj. 双重的，两倍的 adv. 双重地，两倍地


In order to remain broadly applicable, the results shown in this guide are independent of implementation details and apply to all commonly used machine learning frameworks, such as Theano (Bergstra et al., 2010; Bastien et al., 2012), Torch (Collobert et al., 2011), Tensorflow (Abadi et al., 2015) and Caffe (Jia et al., 2014).

This chapter briefly reviews the main building blocks of CNNs, namely discrete convolutions and pooling. For an in-depth treatment of the subject, see Chapter 9 of the Deep Learning textbook (Goodfellow et al., 2016).

in-depth [ɪn depθ]：adj. 彻底的，深入的


1.1 Discrete convolutions

The bread and butter of neural networks is affine transformations: a vector is received as input and is multiplied with a matrix to produce an output (to which a bias vector is usually added before passing the result through a nonlinearity). This is applicable to any type of input, be it an image, a sound clip or an unordered collection of features: whatever their dimensionality, their representation can always be flattened into a vector before the transformation.

bread and butter：涂黄油的面包，基本生活资料，生计
affine transformation：仿射变换


Images, sound clips and many other similar kinds of data have an intrinsic structure. More formally, they share these important properties:

• They are stored as multi-dimensional arrays.
• They feature one or more axes for which ordering matters (e.g., width and height axes for an image, time axis for a sound clip).
• One axis, called the channel axis, is used to access different views of the data (e.g., the red, green and blue channels of a color image, or the left and right channels of a stereo audio track).
stereo [ˈsterɪəʊ]：n. 立体声，立体声系统，铅版，立体照片 adj. 立体的，立体声的，立体感觉的
audio track：音轨，声道


These properties are not exploited when an affine transformation is applied; in fact, all the axes are treated in the same way and the topological information is not taken into account. Still, taking advantage of the implicit structure of the data may prove very handy in solving some tasks, like computer vision and speech recognition, and in these cases it would be best to preserve it. This is where discrete convolutions come into play.

topological [,tɒpə'lɒdʒɪkl]：adj. 拓扑的，局部解剖学的，地志学的


A discrete convolution is a linear transformation that preserves this notion of ordering. It is sparse (only a few input units contribute to a given output unit) and reuses parameters (the same weights are applied to multiple locations in the input).

Figure 1.1 provides an example of a discrete convolution. The light blue grid is called the input feature map. To keep the drawing simple, a single input feature map is represented, but it is not uncommon to have multiple feature maps stacked one onto another.1 A kernel (shaded area) of value slides across the input feature map. At each location, the product between each element of the kernel and the input element it overlaps is computed and the results are summed up to obtain the output in the current location. The procedure can be repeated using different kernels to form as many output feature maps as desired (Figure 1.3). The final outputs of this procedure are called output feature maps.2 If there are multiple input feature maps, the kernel will have to be 3-dimensional - or, equivalently each one of the feature maps will be convolved with a distinct kernel - and the resulting feature maps will be summed up elementwise to produce the output feature map.

1An example of this is what was referred to earlier as channels for images and sound clips.
2While there is a distinction between convolution and cross-correlation from a signal processing perspective, the two become interchangeable when the kernel is learned. For the sake of simplicity and to stay consistent with most of the machine learning literature, the term convolution will be used in this guide.

for the sake of：为了，为了...的利益


Figure 1.1: Computing the output values of a discrete convolution.

Figure 1.2: Computing the output values of a discrete convolution for N = 2 N = 2 , i 1 = i 2 = 5 i_1 = i_2 = 5 , k 1 = k 2 = 3 k_1 = k_2 = 3 , s 1 = s 2 = 2 s_1 = s_2 = 2 , and p 1 = p 2 = 1 p_1 = p_2 = 1 .

The convolution depicted in Figure 1.1 is an instance of a 2-D convolution, but it can be generalized to N-D convolutions. For instance, in a 3-D convolution, the kernel would be a cuboid and would slide across the height, width and depth of the input feature map.

cuboid ['kjuːbɒɪd]：adj. 立方形的，立方体的 n. 长方体，骰骨


The collection of kernels defining a discrete convolution has a shape corresponding to some permutation of ( n n , m m , k 1 k_1 , … , k N k_N ), where
n n ≡ number of output feature maps,
m m ≡ number of input feature maps,
k j k_j ≡ kernel size along axis j j .

The following properties affect the output size o j o_j of a convolutional layer along axis j j :

• i j i_j : input size along axis j j ,
• k j k_j : kernel size along axis j j ,
• s j s_j : stride (distance between two consecutive positions of the kernel) along axis j j ,
• p j p_j : zero padding (number of zeros concatenated at the beginning and at the end of an axis) along axis j j .
permutation [pɜːmjʊ'teɪʃ(ə)n]：n. 排列，置换


For instance, Figure 1.2 shows a 3 × 3 kernel applied to a 5 × 5 input padded with a 1 × 1 border of zeros using 2 × 2 strides.

Note that strides constitute a form of subsampling. As an alternative to being interpreted as a measure of how much the kernel is translated, strides can also be viewed as how much of the output is retained. For instance, moving the kernel by hops of two is equivalent to moving the kernel by hops of one but retaining only odd output elements (Figure 1.4).

retain [rɪ'teɪn]：vt. 保持，雇，记住
hop [hɒp]：v. 单足跳跃 vt. 搭乘 vi. 双足或齐足跳行 n. 蹦跳，跳跃，跳舞，一次飞行的距离


Figure 1.3: A convolution mapping from two input feature maps to three output feature maps using a 3 × 2 × 3 × 3 collection of kernels w. In the left pathway, input feature map 1 is convolved with kernel w 1 , 1 w_{1,1} and input feature map 2 is convolved with kernel w 1 , 2 w_{1,2} , and the results are summed together elementwise to form the first output feature map. The same is repeated for the middle and right pathways to form the second and third feature maps, and all three output feature maps are grouped together to form the output.

Figure 1.4: An alternative way of viewing strides. Instead of translating the 3 × 3 kernel by increments of s = 2 (left), the kernel is translated by increments of 1 and only one in s = 2 output elements is retained (right).

1.2 Pooling

In addition to discrete convolutions themselves, pooling operations make up another important building block in CNNs. Pooling operations reduce the size of feature maps by using some function to summarize subregions, such as taking the average or the maximum value.

Pooling works by sliding a window across the input and feeding the content of the window to a pooling function. In some sense, pooling works very much like a discrete convolution, but replaces the linear combination described by the kernel with some other function. Figure 1.5 provides an example for average pooling, and Figure 1.6 does the same for max pooling.

The following properties affect the output size o j o_j of a pooling layer along axis j j :
i j i_j : input size along axis j j ,
k j k_j : pooling window size along axis j j ,
s j s_j : stride (distance between two consecutive positions of the pooling window) along axis j j .

Figure 1.5: Computing the output values of a 3 × 3 average pooling operation on a 5 × 5 input using 1 × 1 strides.

Figure 1.6: Computing the output values of a 3 × 3 max pooling operation on a 5 × 5 input using 1 × 1 strides.

Chapter 2 Convolution arithmetic

The analysis of the relationship between convolutional layer properties is eased by the fact that they don’t interact across axes, i.e., the choice of kernel size, stride and zero padding along axis j only affects the output size of axis j. Because of that, this chapter will focus on the following simplified setting:

• 2-D discrete convolutions ( N = 2 N = 2 ),
• square inputs ( i 1 = i 2 = i i_1 = i_2 = i ),
• square kernel size ( k 1 = k 2 = k k_1 = k_2 = k ),
• same strides along both axes ( s 1 = s 2 = s s_1 = s_2 = s ),
• same zero padding along both axes ( p 1 = p 2 = p p_1 = p_2 = p ).

This facilitates the analysis and the visualization, but keep in mind that the results outlined here also generalize to the N-D and non-square cases.

facilitate [fə'sɪlɪteɪt]：vt. 促进，帮助，使容易
outline ['aʊtlaɪn]：n. 轮廓，大纲，概要，略图 vt. 概述，略述，描画...轮廓


2.1 No zero padding, unit strides

The simplest case to analyze is when the kernel just slides across every position of the input (i.e., s = 1 s = 1 and p = 0 p = 0 ). Figure 2.1 provides an example for i = 4 i = 4 and k = 3 k = 3 .

Figure 2.1: (No padding, unit strides) Convolving a 3 × 3 kernel over a 4 × 4 input using unit strides (i.e., i = 4 i = 4 , k = 3 k = 3 , s = 1 s = 1 and p = 0 p = 0 ).

One way of defining the output size in this case is by the number of possible placements of the kernel on the input. Let’s consider the width axis: the kernel starts on the leftmost part of the input feature map and slides by steps of one until it touches the right side of the input. The size of the output will be equal to the number of steps made, plus one, accounting for the initial position of the kernel (Figure 2.8a). The same logic applies for the height axis.

More formally, the following relationship can be inferred:
Relationship 1. For any i i and k k , and for s = 1 s = 1 and p = 0 p = 0 ,
o = ( i − k ) + 1. o = (i - k) + 1.

To factor in zero padding (i.e., only restricting to s = 1 s = 1 ), let’s consider its effect on the effective input size: padding with p p zeros changes the effective input size from i i to i + 2 p i + 2p . In the general case, Relationship 1 can then be used to infer
the following relationship:
Relationship 2. For any i i , k k and p p , and for s = 1 s = 1 ,
o = ( i − k ) + 2 p + 1. o = (i - k) + 2p + 1.

Figure 2.2 provides an example for i = 5 i = 5 , k = 4 k = 4 and p = 2 p = 2 .
In practice, two specific instances of zero padding are used quite extensively because of their respective properties. Let’s discuss them in more detail.

Figure 2.2: (Arbitrary padding, unit strides) Convolving a 4 × 4 kernel over a 5 × 5 input padded with a 2 × 2 border of zeros using unit strides (i.e., i = 5 i = 5 , k = 4 k = 4 , s = 1 s = 1 and p = 2 p = 2 ).

Having the output size be the same as the input size (i.e., o = i o = i ) can be a desirable property:
Relationship 3. For any i i and for k k odd ( k = 2 n + 1 , n ∈ N k = 2n + 1, n \in N ), s = 1 s = 1 and p = ⌊ k / 2 ⌋ = n p = \lfloor k/2 \rfloor = n ,
o = ( i − k ) + 2 ⌊ k / 2 ⌋ + 1 o =(i - k) + 2\lfloor k/2 \rfloor + 1
o = i + 2 ⌊ k / 2 ⌋ − ( k − 1 ) o = i + 2\lfloor k/2 \rfloor - (k - 1)
= i + 2 n − 2 n = i + 2n - 2n
= i . = i.

This is sometimes referred to as half (or same) padding. Figure 2.3 provides an example for i = 5 i = 5 , k = 3 k = 3 and (therefore) p = 1 p = 1 .

Figure 2.3: (Half padding, unit strides) Convolving a 3 × 3 kernel over a 5 × 5 input using half padding and unit strides (i.e., i = 5 i = 5 , k = 3 k = 3 , s = 1 s = 1 and p = 1 p = 1 ).

While convolving a kernel generally decreases the output size with respect to the input size, sometimes the opposite is required. This can be achieved with proper zero padding:
Relationship 4. For any i i and k k , and for p = k − 1 p = k - 1 and s = 1 s = 1 ,
o = ( i − k ) + 2 ( k − 1 ) + 1 o = (i - k) + 2(k - 1) + 1
o = i + 2 ( k − 1 ) − k + 1 o = i + 2(k - 1) - k + 1
o = i + 2 ( k − 1 ) − ( k − 1 ) o = i + 2(k - 1) - (k - 1)
= i + ( k − 1 ) . = i + (k - 1).

This is sometimes referred to as full padding, because in this setting every possible partial or complete superimposition of the kernel on the input feature map is taken into account. Figure 2.4 provides an example for i = 5, k = 3 and (therefore) p = 2.

Figure 2.4: (Full padding, unit strides) Convolving a 3 × 3 kernel over a 5 × 5 input using full padding and unit strides (i.e., i = 5 i = 5 , k = 3 k = 3 , s = 1 s = 1 and p = 2 p = 2 ).

superimposition [,sʊpɚ,ɪmpə'zɪʃn]：n. 添上，叠印，重迭


2.3 No zero padding, non-unit strides

All relationships derived so far only apply for unit-strided convolutions. Incorporating non unitary strides requires another inference leap. To facilitate the analysis, let’s momentarily ignore zero padding (i.e., s &gt; 1 s &gt; 1 and p = 0 p = 0 ). Figure 2.5 provides an example for i = 5 i = 5 , k = 3 k = 3 and s = 2 s = 2 .

momentarily ['məʊm(ə)nt(ə)rɪlɪ; ,məʊm(ə)n'terɪlɪ]：adv. 随时地，暂时地，立刻
leap [liːp]：vi. 跳，跳跃 n. 飞跃，跳跃 vt. 跳跃，跳过，使跃过


Figure 2.5: (No zero padding, arbitrary strides) Convolving a 3 × 3 kernel over a 5 × 5 input using 2 × 2 strides (i.e., i = 5 i = 5 , k = 3 k = 3 , s = 2 s = 2 and p = 0 p = 0 ).

Once again, the output size can be defined in terms of the number of possible placements of the kernel on the input. Let’s consider the width axis: the kernel starts as usual on the leftmost part of the input, but this time it slides by steps of size s s until it touches the right side of the input. The size of the output is again equal to the number of steps made, plus one, accounting for the initial position of the kernel (Figure 2.8b). The same logic applies for the height axis.

From this, the following relationship can be inferred:
Relationship 5. For any i i , k k and s s , and for p = 0 p = 0 ,
o = ⌊ i − k s ⌋ + 1. o = \lfloor {\frac{i - k}{s} \qquad} \rfloor + 1.

The floor function accounts for the fact that sometimes the last possible step does not coincide with the kernel reaching the end of the input, i.e., some input units are left out (see Figure 2.7 for an example of such a case).

left out：忽视，不考虑，被遗忘


The most general case (convolving over a zero padded input using non-unit strides) can be derived by applying Relationship 5 on an effective input of size i + 2p, in analogy to what was done for Relationship 2:

Relationship 6. For any i i , k k , p p and s s ,
o = ⌊ i + 2 p − k s ⌋ + 1. o = \lfloor {\frac{i + 2p - k} {s} \qquad} \rfloor + 1.

As before, the floor function means that in some cases a convolution will produce the same output size for multiple input sizes. More specifically, if i + 2 p − k i + 2p - k is a multiple of s s , then any input size j = i + a , a ∈ { 0 , . . . , s − 1 } j = i + a, a \in \{0, ... , s - 1\} will produce the same output size. Note that this ambiguity applies only for s &gt; 1 s &gt; 1 .

Figure 2.6 shows an example with i = 5 i = 5 , k = 3 k = 3 , s = 2 s = 2 and p = 1 p = 1 , while Figure 2.7 provides an example for i = 6 i = 6 , k = 3 k = 3 , s = 2 s = 2 and p = 1 p = 1 . Interestingly, despite having different input sizes these convolutions share the same output size. While this doesn’t affect the analysis for convolutions, this will complicate the analysis in the case of transposed convolutions.

complicate ['kɒmplɪkeɪt]：vt. 使复杂化，使恶化，使卷入


Figure 2.6: (Arbitrary padding and strides) Convolving a 3 × 3 kernel over a 5 × 5 input padded with a 1 × 1 border of zeros using 2 × 2 strides (i.e., i = 5 i = 5 , k = 3 k = 3 , s = 2 s = 2 and p = 1 p = 1 ).

Figure 2.7: (Arbitrary padding and strides) Convolving a 3 × 3 kernel over a 6 × 6 input padded with a 1 × 1 border of zeros using 2 × 2 strides (i.e., i = 6 i = 6 , k = 3 k = 3 , s = 2 s = 2 and p = 1 p = 1 ). In this case, the bottom row and right column of the zero padded input are not covered by the kernel.

Figure 2.8: Counting kernel positions.

Chapter 3 Pooling arithmetic

In a neural network, pooling layers provide invariance to small translations of the input. The most common kind of pooling is max pooling, which consists in splitting the input in (usually non-overlapping) patches and outputting the maximum value of each patch. Other kinds of pooling exist, e.g., mean or average pooling, which all share the same idea of aggregating the input locally by applying a non-linearity to the content of some patches (Boureau et al., 2010a,b, 2011; Saxe et al., 2011).

Some readers may have noticed that the treatment of convolution arithmetic only relies on the assumption that some function is repeatedly applied onto subsets of the input. This means that the relationships derived in the previous chapter can be reused in the case of pooling arithmetic. Since pooling does not involve zero padding, the relationship describing the general case is as follows:
Relationship 7. For any i i , k k and s s ,
o = ⌊ i − k s ⌋ + 1. o = \lfloor \frac{i - k} {s} \qquad \rfloor + 1.

This relationship holds for any type of pooling.

invariance [ɪn'vɛrɪəns]：n. 不变性，不变式
aggregate ['ægrɪgət; (for v.) ˈægrɪgeɪt]；vi. 集合，聚集，合计 vt. 集合，聚集，合计 n. 合计，集合体，总计 adj. 聚合的，集合的，合计的


03-10 1833
12-23 28
12-19 552