@@ -4,7 +4,7 @@ The ResNet architecture, consists of blocks of four basic operations: Convolutio

\subsection{Convolution}

The convolution operation follows directly from the discussion in Section \ref{sec:backjlm}. The convolution operation in the spatial domain is a shorthand notation for a linear map $C: N^*\otimes P^*\otimes H^*\otimes W^*\rightarrow N^*\otimes P^*\otimes H^*\otimes W^*$. Since the same operation is applied to each image in the batch, we can represent $C$ with a type (3, 3) tensor. The entries of this tensor give the coefficient for a given pixel in a given input channel for each pixel in each output channel. We now develop the algorithm for representing discrete convolutional filters using this data structure.

The convolution operation follows directly from the discussion in Section \ref{sec:backjlm}. The convolution operation is a shorthand notation for a linear map $C: N^*\otimes P^*\otimes H^*\otimes W^*\rightarrow N^*\otimes P^*\otimes H^*\otimes W^*$. Since the same operation is applied to each image in the batch, we can represent $C$ with a type (3, 3) tensor. The entries of this tensor give the coefficient for a given pixel in a given input channel for each pixel in each output channel. We now develop the algorithm for representing discrete convolutional filters using this data structure.

A naive algorithm can simply copy randomly initialized convolution weights into this larger structure following the formula for convolution and then apply the JPEG compression and decompression tensors to the result. However, this is difficult to parallelize and incurs additional memory overhead to store the spatial domain operation. A more efficient algorithm would produce the JPEG domain operation directly and be easy to express as a compute kernel for a GPU. Start by considering the JPEG decompression tensor $\widetilde{J}$. Note that since $\widetilde{J}\in X \otimes Y \otimes K \otimes H^*\otimes W^*$ the last two indices of $\widetilde{J}$ form single channel image under our image model (\eg the last two indices are in $H^*\otimes W^*$). If the convolution can be applied to this "image", then the resulting map would decompress and convolve simultaneously. We can formulate a new tensor $\widehat{J}\in N \otimes H^*\otimes W^*$

by reshaping $\widetilde{J}$ and treating this as a batch of images \footnote{Consider as a concrete example using $32\times32$ images. Then $\widetilde{J}$ is of shape $4\times4\times64\times32\times32$ and the described reshaping gives $\widehat{J}$ of shape $1024\times1\times32\times32$ which can be treated as a batch of size 1024 of $32\times32$ images for convolution.}. Then, given randomly initialized filter weights, $K$ computing

@@ -87,7 +87,7 @@ Next, we express the DCT as a linear map such that $X = DY$ and rewrite the prev

\begin{equation}

\var[X] = \e[(DY)^2]

\end{equation}

Distributing the squaring operation gives

Squaring gives

\begin{equation}

\e[(DY)^2] = \e[(D^TD)Y^2]

\end{equation}

...

...

@@ -101,17 +101,17 @@ Since $D$ is orthogonal this simplifies to

We conclude by outlining in pseudocode the algorithms for the three layer operations described in the paper. Algorithm \ref{alg:dce} gives the code for convolution explosion, Algorithm \ref{alg:asmr} gives the code for the ASM ReLu approximation, and Algorithm \ref{alg:bn} gives the code for Batch Normalization.

\captionof{algorithm}{Convolution Explosion. $K$ is an initial filter, $m, n$ are the input and output channels, $h, w$ are the image height and width, $s$ is the stride, $\star_s$ denotes the discrete convolution with stride $s$}

\captionof{algorithm}{Convolution Explosion. $K$ is an initial filter, $p, p'$ are the input and output channels, $h, w$ are the image height and width, $s$ is the stride, $\star_s$ denotes the discrete convolution with stride $s$. $J$ and $\widetilde{J}$ are constants of shape $(x, y, k, h, w)$ with $y = h /8$, $x = w /8$, $k =64$.}