Paper - AdderNet: Do We Really Need Multiplications in Deep Learning?

Review Note

Last updated on May 29, 2021 6 min read

1 Introduction

Origin:

CVPR 2020 Oral

Author:

Peking University
Huawei
The University of Sydney

Task:

Energy efficient networks

Existing Methods:

Network pruning
Efficient blocks design (e.g. MobileNet, ShuffleNet)
Knowledge Distillation
Low-bit Quantization (e.g. Binary Neural Network)

Limitations:

Containing massive multiplications

Proposed Method:

Today’s convolution operation of convolution neural network (CNN) includes a lot of multiplication. Although there are many lightweight networks (such as MobileNet) proposed, the cost of multiplication is still hard to be ignored. To carry out deep learning application in edge devices, it is necessary to further lower the computation cost and energy consumption, so this paper proposes to use addition operation instead of multiplication operation to perform deep neural networks.

The paper points out that the traditional convolution operation is actually a kind of cross-correlation operation used to measure the similarity between input features and convolution kernels, and this cross-correlation operation introduces many multiplication operations which increase the computation cost and energy consumption, so the paper proposes another way to measure the similarity between input features and convolution kernels, which is $ℓ_{1}$ distance.

2 Method

2.1 Similarity Measurment

Suppose that $F \in R^{d \times d \times c_{i n} \times c_{o u t}}$ , $F$ is the filter of a layer in the middle of the network, the filter size is $d$ , the input has $c_{i n}$ channels, and the output has $c_{o u t}$ channels. The input feature is defined as $X \in R^{H \times W \times c_{i n}}$ , where $H$ and $W$ correspond to the height and width of the feature. Then the output $Y$ has the following formula:

$\begin{matrix} (1) & Y (m, n, t) = \sum_{i = 0}^{d} \sum_{j = 0}^{d} \sum_{k = 0}^{c_{i n}} S (X (m + i, n + j, k), F (i, j, k, t)), \end{matrix}$

where $S$ is the similarity function.

2.2 Convolution Kernel

If $S (x, y) = X \times F$ , then formula (1) becomes the convolution operation in the traditional convolutional neural network:

$\begin{matrix} (2) & Y (m, n, t) = \sum_{i = 0}^{d} \sum_{j = 0}^{d} \sum_{k = 0}^{m} X (m + i, n + j, k) \cdot F (i, j, k, t) \end{matrix}$

2.3 Addition Kernel

As mentioned above, if $ℓ_{1}$ distance ${| x |}_{1} = | x_{1} | + | x_{2} | + \dots + | x_{n} |$ is used instead of cross-correlation operation, then formula (1) becomes:

$\begin{matrix} (3) & Y (m, n, t) = - \sum_{i = 0}^{d} \sum_{j = 0}^{d} \sum_{k = 0}^{c_{i n}} | X (m + i, n + j, k) - F (i, j, k, t) |, \end{matrix}$

here the author mentioned that the results of such operations are all negative, but the output value of the traditional convolutional neural network is positive or negative, therefore, behind the output layer is the Batch Normalization (BN) layer, which makes the output distribution in a reasonable range.

2.4 Optimization

In traditional convolution networks, the back propagation formula of $Y$ about $F$ is as follows:

$\begin{matrix} (4) & \frac{\partial Y (m, n, t)}{\partial F (i, j, k, t)} = X (m + i, n + j, k) \end{matrix}$

The derivative of $ℓ_{1}$ -norm is a piecewise function:

$\frac{\partial}{\partial x} | | x | |_{1} = sign (x), where sign (x) = {\begin{cases} - 1 & x_{i} < 0 \\ + 1 & x_{i} > 0 \\ [- 1, 1] & x_{i} = 0 \end{cases}$

In AdderNet, the back propagation formula of $ℓ_{1}$ -norm is as follows:

$\begin{matrix} (5) & \frac{\partial Y (m, n, t)}{\partial F (i, j, k, t)} = sgn (X (m + i, n + j, k) - F (i, j, k, t)) \end{matrix}$

It is pointed out in the paper that the signSGD of $ℓ_{1}$ -norm back propagation cannot decline in the best direction, and it will choose the worse direction sometimes. Therefore, the paper changes the back propagation formula into $ℓ_{2}$ -norm back propagation, which is called full precision gradient.

The derivative of $ℓ_{2}$ -norm:

$\frac{\partial | | x | |_{2}^{2}}{\partial x} = \frac{\partial | | x^{T} x | |_{2}}{\partial x} = 2 x$

In AdderNet, the back propagation formula of $ℓ_{2}$ -norm is as follows:

$\frac{\partial Y (m, n, t)}{\partial F (i, j, k, t)} = X (m + i, n + j, k) - F (i, j, k, t)$

Considering the derivative of $Y$ to $X$ , according to the chain rule, $\frac{\partial Y}{\partial F_{i}}$ is only related to $F_{i}$ itself, while $\frac{\partial Y}{\partial X_{i}}$ will affects layers before the $i$ th layer, which magnified the gradient. This paper use HardTanh function to clip the gradient of $X$ to prevent gradient exploding.

$\begin{matrix} (6) & \frac{\partial Y (m, n, t)}{\partial X (m + i, n + j, k)} = HT (F (i, j, k, t) - X (m + i, n + j, k)), \end{matrix}$

$w h e r e HT (x) = {\begin{cases} x & - 1 < x < 1 \\ - 1 & x < - 1 \\ 1 & x > 1 \end{cases}$

2.5 Adaptive learning rate

In traditional CNN, normally we expect that the output distribution between each layer is similar, so that the weights of the network are more stable. Considering the variance of the output features, we assume that all values in $F$ and $X$ are independent and identically distributed, then the variance in CNN has the following formula:

$\begin{matrix} (7) & \begin{aligned} Var [Y_{C N N}] & = \sum_{i = 0}^{d} \sum_{j = 0}^{d} \sum_{k = 0}^{c_{in}} Var [X \times F] \\ = d^{2} c_{i n} Var [X] Var [F] \end{aligned} \end{matrix}$

Similally, the variance in AdderNet has the following formula:

$\begin{matrix} (8) & \begin{aligned} Var [Y_{AdderNet}] & = \sum_{i = 0}^{d} \sum_{j = 0}^{d} \sum_{k = 0}^{c_{in}} Var [| X - F |] \\ = (1 - \frac{2}{π}) d^{2} c_{i n} (Var [X] + Var [F]) \end{aligned} \end{matrix}$

The variance of CNN is small because $Var [F]$ is close to zero, while the variance of AdderNet is much lager than CNN because $Var [X]$ is large. This paper use BN to normalize the output of a lay to avoid the variance exploding.

In AdderNet, the output of each layer is followed by a BN layer. Although BN layer brings some multiplication operations, the magnitude of these operations can be ignored compared with that of classical convolution network. Given a mini-batch input $B = {x_{1}, \dots, x_{m}}$ , BN layer can be defined as the following formula:

$\begin{matrix} (9) & y = γ \frac{x - μ_{B}}{σ_{B}} + β = γ \hat{x} + β, \end{matrix}$

where $γ$ and $β$ are parameters to learn, $μ$ and $σ$ are mean and variance of output.

Then the partial derivation of $x$ after adding BN layer is:

$\begin{matrix} (10) & \frac{\partial ℓ}{\partial x_{i}} = \sum_{j = 1}^{m} \frac{γ}{m^{2} σ_{B}} {\frac{\partial ℓ}{\partial y_{i}} - \frac{\partial ℓ}{\partial y_{j}} [1 + \frac{(x_{i} - x_{j}) (x_{j} - μ_{B})}{σ_{B}}]} \end{matrix}$

Since the weight gradient depends on the variance, the weight gradient in the AdderNet with BN layer will be very small. The following table is a comparison:

Table 1 The

ℓ_{2}

-norm of gradient of weight in each layer using different networks at 1st iteration.

layer_weight_table

Except for showing a small gradient, the table also shows that some layers may not be of the same magnitude, so it is no longer appropriate to use a global unified learning rate. Therefore, an adaptive learning rate method is used in the paper, making the learning rate different in each layer. Its formula calculation is expressed as:

$\begin{matrix} (11) & Δ F_{l} = γ \times α_{l} \times Δ L (F_{l}), \end{matrix}$

where $γ$ is a global learning rate, $Δ F_{l}$ is the change of the filter in layer $l$ , and $α_{l}$ is its corresponding local learning rate.

$\begin{matrix} (12) & α_{l} = \frac{η}{\sqrt{k} ‖ Δ L (F_{i}) ‖_{2}}, \end{matrix}$

where $k$ is the number of elements in $F_{l}$ , $η$ is a hyper-parameter to control the learning rate of adder filters.

With such a learning rate adjustment, the learning rate can be automatically adapted to the situation of the current layer in each layer.

2.6 Training Procedure

The above is all the new ideas and designs proposed in this paper, and the process description of forward and backward propagation of AdderNet is also given in this paper as below.

algorithm flow

3 Experiment

See the original paper for the comparison of experimental data and results.

Thinking

When I first read this paper, I thought it was another version of the Binary Neural Network, but after reading it carefully, I found that it was actually a new form of reducing the FLOP. The amount of calculation is a compromise between quantization calculation and full-precision calculation, with a comparable accuracy compared to the traditional CNNs. The storage amount would be larger than quantified CNNs. I think if we can combine other methods (such as quantization, pruning) and other algorithms, there will be more new ideas. It is a direction worth to study.