.. _sec_padding:
Padding and Stride
==================
Recall the example of a convolution in :numref:`fig_correlation`. The
input had both a height and width of 3 and the convolution kernel had
both a height and width of 2, yielding an output representation with
dimension :math:`2\times2`. Assuming that the input shape is
:math:`n_\textrm{h}\times n_\textrm{w}` and the convolution kernel shape
is :math:`k_\textrm{h}\times k_\textrm{w}`, the output shape will be
:math:`(n_\textrm{h}-k_\textrm{h}+1) \times (n_\textrm{w}-k_\textrm{w}+1)`:
we can only shift the convolution kernel so far until it runs out of
pixels to apply the convolution to.
In the following we will explore a number of techniques, including
padding and strided convolutions, that offer more control over the size
of the output. As motivation, note that since kernels generally have
width and height greater than :math:`1`, after applying many successive
convolutions, we tend to wind up with outputs that are considerably
smaller than our input. If we start with a :math:`240 \times 240` pixel
image, ten layers of :math:`5 \times 5` convolutions reduce the image to
:math:`200 \times 200` pixels, slicing off :math:`30 \%` of the image
and with it obliterating any interesting information on the boundaries
of the original image. *Padding* is the most popular tool for handling
this issue. In other cases, we may want to reduce the dimensionality
drastically, e.g., if we find the original input resolution to be
unwieldy. *Strided convolutions* are a popular technique that can help
in these instances.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import torch
from torch import nn
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
from mxnet import np, npx
from mxnet.gluon import nn
npx.set_np()
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import jax
from flax import linen as nn
from jax import numpy as jnp
from d2l import jax as d2l
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import tensorflow as tf
.. raw:: html
.. raw:: html
Padding
-------
As described above, one tricky issue when applying convolutional layers
is that we tend to lose pixels on the perimeter of our image. Consider
:numref:`img_conv_reuse` that depicts the pixel utilization as a
function of the convolution kernel size and the position within the
image. The pixels in the corners are hardly used at all.
.. _img_conv_reuse:
.. figure:: ../img/conv-reuse.svg
Pixel utilization for convolutions of size :math:`1 \times 1`,
:math:`2 \times 2`, and :math:`3 \times 3` respectively.
Since we typically use small kernels, for any given convolution we might
only lose a few pixels but this can add up as we apply many successive
convolutional layers. One straightforward solution to this problem is to
add extra pixels of filler around the boundary of our input image, thus
increasing the effective size of the image. Typically, we set the values
of the extra pixels to zero. In :numref:`img_conv_pad`, we pad a
:math:`3 \times 3` input, increasing its size to :math:`5 \times 5`. The
corresponding output then increases to a :math:`4 \times 4` matrix. The
shaded portions are the first output element as well as the input and
kernel tensor elements used for the output computation:
:math:`0\times0+0\times1+0\times2+0\times3=0`.
.. _img_conv_pad:
.. figure:: ../img/conv-pad.svg
Two-dimensional cross-correlation with padding.
In general, if we add a total of :math:`p_\textrm{h}` rows of padding
(roughly half on top and half on bottom) and a total of
:math:`p_\textrm{w}` columns of padding (roughly half on the left and
half on the right), the output shape will be
.. math:: (n_\textrm{h}-k_\textrm{h}+p_\textrm{h}+1)\times(n_\textrm{w}-k_\textrm{w}+p_\textrm{w}+1).
This means that the height and width of the output will increase by
:math:`p_\textrm{h}` and :math:`p_\textrm{w}`, respectively.
In many cases, we will want to set :math:`p_\textrm{h}=k_\textrm{h}-1`
and :math:`p_\textrm{w}=k_\textrm{w}-1` to give the input and output the
same height and width. This will make it easier to predict the output
shape of each layer when constructing the network. Assuming that
:math:`k_\textrm{h}` is odd here, we will pad :math:`p_\textrm{h}/2`
rows on both sides of the height. If :math:`k_\textrm{h}` is even, one
possibility is to pad :math:`\lceil p_\textrm{h}/2\rceil` rows on the
top of the input and :math:`\lfloor p_\textrm{h}/2\rfloor` rows on the
bottom. We will pad both sides of the width in the same way.
CNNs commonly use convolution kernels with odd height and width values,
such as 1, 3, 5, or 7. Choosing odd kernel sizes has the benefit that we
can preserve the dimensionality while padding with the same number of
rows on top and bottom, and the same number of columns on left and
right.
Moreover, this practice of using odd kernels and padding to precisely
preserve dimensionality offers a clerical benefit. For any
two-dimensional tensor ``X``, when the kernel’s size is odd and the
number of padding rows and columns on all sides are the same, thereby
producing an output with the same height and width as the input, we know
that the output ``Y[i, j]`` is calculated by cross-correlation of the
input and convolution kernel with the window centered on ``X[i, j]``.
In the following example, we create a two-dimensional convolutional
layer with a height and width of 3 and apply 1 pixel of padding on all
sides. Given an input with a height and width of 8, we find that the
height and width of the output is also 8.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# We define a helper function to calculate convolutions. It initializes the
# convolutional layer weights and performs corresponding dimensionality
# elevations and reductions on the input and output
def comp_conv2d(conv2d, X):
# (1, 1) indicates that batch size and the number of channels are both 1
X = X.reshape((1, 1) + X.shape)
Y = conv2d(X)
# Strip the first two dimensions: examples and channels
return Y.reshape(Y.shape[2:])
# 1 row and column is padded on either side, so a total of 2 rows or columns
# are added
conv2d = nn.LazyConv2d(1, kernel_size=3, padding=1)
X = torch.rand(size=(8, 8))
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
torch.Size([8, 8])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# We define a helper function to calculate convolutions. It initializes
# the convolutional layer weights and performs corresponding dimensionality
# elevations and reductions on the input and output
def comp_conv2d(conv2d, X):
conv2d.initialize()
# (1, 1) indicates that batch size and the number of channels are both 1
X = X.reshape((1, 1) + X.shape)
Y = conv2d(X)
# Strip the first two dimensions: examples and channels
return Y.reshape(Y.shape[2:])
# 1 row and column is padded on either side, so a total of 2 rows or columns are added
conv2d = nn.Conv2D(1, kernel_size=3, padding=1)
X = np.random.uniform(size=(8, 8))
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[22:06:32] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(8, 8)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# We define a helper function to calculate convolutions. It initializes
# the convolutional layer weights and performs corresponding dimensionality
# elevations and reductions on the input and output
def comp_conv2d(conv2d, X):
# (1, X.shape, 1) indicates that batch size and the number of channels are both 1
key = jax.random.PRNGKey(d2l.get_seed())
X = X.reshape((1,) + X.shape + (1,))
Y, _ = conv2d.init_with_output(key, X)
# Strip the dimensions: examples and channels
return Y.reshape(Y.shape[1:3])
# 1 row and column is padded on either side, so a total of 2 rows or columns are added
conv2d = nn.Conv(1, kernel_size=(3, 3), padding='SAME')
X = jax.random.uniform(jax.random.PRNGKey(d2l.get_seed()), shape=(8, 8))
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(8, 8)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# We define a helper function to calculate convolutions. It initializes
# the convolutional layer weights and performs corresponding dimensionality
# elevations and reductions on the input and output
def comp_conv2d(conv2d, X):
# (1, 1) indicates that batch size and the number of channels are both 1
X = tf.reshape(X, (1, ) + X.shape + (1, ))
Y = conv2d(X)
# Strip the first two dimensions: examples and channels
return tf.reshape(Y, Y.shape[1:3])
# 1 row and column is padded on either side, so a total of 2 rows or columns
# are added
conv2d = tf.keras.layers.Conv2D(1, kernel_size=3, padding='same')
X = tf.random.uniform(shape=(8, 8))
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
TensorShape([8, 8])
.. raw:: html
.. raw:: html
When the height and width of the convolution kernel are different, we
can make the output and input have the same height and width by setting
different padding numbers for height and width.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# We use a convolution kernel with height 5 and width 3. The padding on either
# side of the height and width are 2 and 1, respectively
conv2d = nn.LazyConv2d(1, kernel_size=(5, 3), padding=(2, 1))
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
torch.Size([8, 8])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# We use a convolution kernel with height 5 and width 3. The padding on
# either side of the height and width are 2 and 1, respectively
conv2d = nn.Conv2D(1, kernel_size=(5, 3), padding=(2, 1))
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(8, 8)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# We use a convolution kernel with height 5 and width 3. The padding on
# either side of the height and width are 2 and 1, respectively
conv2d = nn.Conv(1, kernel_size=(5, 3), padding=(2, 1))
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(8, 8)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# We use a convolution kernel with height 5 and width 3. The padding on
# either side of the height and width are 2 and 1, respectively
conv2d = tf.keras.layers.Conv2D(1, kernel_size=(5, 3), padding='same')
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
TensorShape([8, 8])
.. raw:: html
.. raw:: html
Stride
------
When computing the cross-correlation, we start with the convolution
window at the upper-left corner of the input tensor, and then slide it
over all locations both down and to the right. In the previous examples,
we defaulted to sliding one element at a time. However, sometimes,
either for computational efficiency or because we wish to downsample, we
move our window more than one element at a time, skipping the
intermediate locations. This is particularly useful if the convolution
kernel is large since it captures a large area of the underlying image.
We refer to the number of rows and columns traversed per slide as
*stride*. So far, we have used strides of 1, both for height and width.
Sometimes, we may want to use a larger stride.
:numref:`img_conv_stride` shows a two-dimensional cross-correlation
operation with a stride of 3 vertically and 2 horizontally. The shaded
portions are the output elements as well as the input and kernel tensor
elements used for the output computation:
:math:`0\times0+0\times1+1\times2+2\times3=8`,
:math:`0\times0+6\times1+0\times2+0\times3=6`. We can see that when the
second element of the first column is generated, the convolution window
slides down three rows. The convolution window slides two columns to the
right when the second element of the first row is generated. When the
convolution window continues to slide two columns to the right on the
input, there is no output because the input element cannot fill the
window (unless we add another column of padding).
.. _img_conv_stride:
.. figure:: ../img/conv-stride.svg
Cross-correlation with strides of 3 and 2 for height and width,
respectively.
In general, when the stride for the height is :math:`s_\textrm{h}` and
the stride for the width is :math:`s_\textrm{w}`, the output shape is
.. math:: \lfloor(n_\textrm{h}-k_\textrm{h}+p_\textrm{h}+s_\textrm{h})/s_\textrm{h}\rfloor \times \lfloor(n_\textrm{w}-k_\textrm{w}+p_\textrm{w}+s_\textrm{w})/s_\textrm{w}\rfloor.
If we set :math:`p_\textrm{h}=k_\textrm{h}-1` and
:math:`p_\textrm{w}=k_\textrm{w}-1`, then the output shape can be
simplified to
:math:`\lfloor(n_\textrm{h}+s_\textrm{h}-1)/s_\textrm{h}\rfloor \times \lfloor(n_\textrm{w}+s_\textrm{w}-1)/s_\textrm{w}\rfloor`.
Going a step further, if the input height and width are divisible by the
strides on the height and width, then the output shape will be
:math:`(n_\textrm{h}/s_\textrm{h}) \times (n_\textrm{w}/s_\textrm{w})`.
Below, we set the strides on both the height and width to 2, thus
halving the input height and width.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
conv2d = nn.LazyConv2d(1, kernel_size=3, padding=1, stride=2)
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
torch.Size([4, 4])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
conv2d = nn.Conv2D(1, kernel_size=3, padding=1, strides=2)
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(4, 4)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
conv2d = nn.Conv(1, kernel_size=(3, 3), padding=1, strides=2)
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(4, 4)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
conv2d = tf.keras.layers.Conv2D(1, kernel_size=3, padding='same', strides=2)
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
TensorShape([4, 4])
.. raw:: html
.. raw:: html
Let’s look at a slightly more complicated example.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
conv2d = nn.LazyConv2d(1, kernel_size=(3, 5), padding=(0, 1), stride=(3, 4))
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
torch.Size([2, 2])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
conv2d = nn.Conv2D(1, kernel_size=(3, 5), padding=(0, 1), strides=(3, 4))
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(2, 2)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
conv2d = nn.Conv(1, kernel_size=(3, 5), padding=(0, 1), strides=(3, 4))
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(2, 2)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
conv2d = tf.keras.layers.Conv2D(1, kernel_size=(3,5), padding='valid',
strides=(3, 4))
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
TensorShape([2, 1])
.. raw:: html
.. raw:: html
Summary and Discussion
----------------------
Padding can increase the height and width of the output. This is often
used to give the output the same height and width as the input to avoid
undesirable shrinkage of the output. Moreover, it ensures that all
pixels are used equally frequently. Typically we pick symmetric padding
on both sides of the input height and width. In this case we refer to
:math:`(p_\textrm{h}, p_\textrm{w})` padding. Most commonly we set
:math:`p_\textrm{h} = p_\textrm{w}`, in which case we simply state that
we choose padding :math:`p`.
A similar convention applies to strides. When horizontal stride
:math:`s_\textrm{h}` and vertical stride :math:`s_\textrm{w}` match, we
simply talk about stride :math:`s`. The stride can reduce the resolution
of the output, for example reducing the height and width of the output
to only :math:`1/n` of the height and width of the input for
:math:`n > 1`. By default, the padding is 0 and the stride is 1.
So far all padding that we discussed simply extended images with zeros.
This has significant computational benefit since it is trivial to
accomplish. Moreover, operators can be engineered to take advantage of
this padding implicitly without the need to allocate additional memory.
At the same time, it allows CNNs to encode implicit position information
within an image, simply by learning where the “whitespace” is. There are
many alternatives to zero-padding.
:cite:t:`Alsallakh.Kokhlikyan.Miglani.ea.2020` provided an extensive
overview of those (albeit without a clear case for when to use nonzero
paddings unless artifacts occur).
Exercises
---------
1. Given the final code example in this section with kernel size
:math:`(3, 5)`, padding :math:`(0, 1)`, and stride :math:`(3, 4)`,
calculate the output shape to check if it is consistent with the
experimental result.
2. For audio signals, what does a stride of 2 correspond to?
3. Implement mirror padding, i.e., padding where the border values are
simply mirrored to extend tensors.
4. What are the computational benefits of a stride larger than 1?
5. What might be statistical benefits of a stride larger than 1?
6. How would you implement a stride of :math:`\frac{1}{2}`? What does it
correspond to? When would this be useful?
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html