In this section, we’ll utilize knowledge we learnt from the last section (see here), to implement a ResNet Network (paper).
Note that we follow the original paper’s work. Our implementation is a simper version of the official torchvision implementation. (That is, we only implement the key structure, and the random weight init. We don’t consider dilation or other things).
Preliminaries: Calculate the feature map size
Basic formula
Given a convolution kernel with size K, and the padding P, the stride S, feature map size I, we can calculate the output size as O = ( I - K + 2P ) / S + 1.
Corollary
Based on the formula above, we know that when S=1:
K=3, P=1 makes the input size and output size same.
K=1, P=0 makes the input size and output size same.
Overall Structure
The Table 1 in the original paper illustrates the overall structure of the ResNet:
We know that from conv2, each layer consists of many blocks. And the blocks in 18, 34 layers is different from blocks in 50, 101, 152 layers.
We have several deductions:
When the feature map enters the next layer, the first block need to do a down sampling operation. This is done by setting the one of the convolution kernel’s stride=2.
At other convolution kernels, the feature map’s size is same. So the convolution settings is same as the one referred in Preliminaries.
Basic Block Implementation
The basic block’s structure looks like this:
Please see the code below. Here, apart from channels defining the channels in the block, we have three additional parameters, in_channels, stride, and downsample to make this block versatile in the FIRST block in each layer.
According to the ResNet structure, for example, the first block in layer3 has the input 64*56*56. The first block in layer3 has two tasks:
Make the feature map size to 28*28. Thus we need to set its stride to 2.
Make the number of channels from 64 to 128. Thus the in_channel should be 64.
In addition, since the input is 64*56*56, while the output is 128*28*28, we need a down sample convolution to match the shortcut input to the output size.
defforward(self, x): residual = x x = self.conv1(x) x = self.batchnorm1(x) x = self.relu1(x) x = self.conv2(x) x = self.batchnorm2(x) if self.downsample: residual = self.downsample(residual) x += residual x = self.relu2(x) return x
Bottleneck Block Implementation
The bottleneck block’s structure looks like this:
To reduce the computation cost, the Bottleneck block use 1x1 kernel to map the high number of channels (e.g., 256) to a low one (e.g., 64), and do the 3x3 convolution. Then, it maps the 64 channels to 256 again.
Please see the code below. Same as the basic block, We have three additional parameters, in_channels, stride, and downsample to make this block versatile in the FIRST block in each layer. The reasons are same as above.
defforward(self, x): residual = x x = self.conv1(x) x = self.batchnorm1(x) x = self.relu1(x)
x = self.conv2(x) x = self.batchnorm2(x) x = self.relu2(x)
x = self.conv3(x) x = self.batchnorm3(x)
if self.downsample: residual = self.downsample(residual)
x += residual x = self.relu3(x) return x
ResNet Base Implementation
Then we can put thing together to form the ResNet model! The whole structure is straight-forward. We define the submodules one by one, and implement the forward() function.
There is only two tricky point:
To support the ResNetBase for two different base blocks, the base block can be passed to this initializer. Since two base blocks have slightly differences in setting the channels, ResidualBasicBlock and ResidualBottleNeck have an attribute called expansion, which convenient the procedure in setting the correct number of channels and outputs.
See the _make_layer function below. It need to determine whether we need to do the down sample. And the condition and explanation is described below.
downsample = None if stride != 1or self.in_channels != channel*self.block.expansion: # Use downsample to match the dimension in two cases: # 1. stride != 1, meaning we should downsample H, W in this layer. # Then we need to match the residual's H, W and the output's H, W of this layer. # 2. self.in_channels != channel*block.expansion, meaning we should increase C in this layer. # Then we need to match the residual's C and the output's C of this layer.