We introduce a general method of performing Residual Network inference and learning in the JPEG transform domain that allows the network to consume compressed images as input. Our formulation leverages the linearity of the JPEG transform to redefine convolution and batch normalization with a tune-able numerical approximation for ReLu. The result is mathematically equivalent to the spatial domain network up to the ReLu approximation accuracy. A formulation for image classification and a model conversion algorithm for spatial domain networks are given as examples of the method. We show that the sparsity of the JPEG format allows for faster processing of images with little to no penalty in the network accuracy.

We introduce a general method of performing Residual Network inference and learning in the JPEG transform domain that allows the network to consume compressed images as input. Our formulation leverages the linearity of the JPEG transform to redefine convolution and batch normalization with a tune-able numerical approximation for ReLu. The result is mathematically equivalent to the spatial domain network up to the ReLu approximation accuracy. A formulation for image classification and a model conversion algorithm for spatial domain networks are given as examples of the method. We show skipping the costly decompression step allows for faster processing of images with little to no penalty in the network accuracy.

We say that $I'$ is the representation of $I$ in the JPEG transform domain. The indices $h,w$ give pixel position, $x,y$ give block position, and $k$ gives the offset into the block.

The form of $J$ is constructed from the JPEG compression steps listed in the previous section. Let the linear map $B: H^*\otimes W^*\rightarrow X^*\otimes Y^*\otimesI^*\otimesJ^*$ be defined as

The form of $J$ is constructed from the JPEG compression steps listed in the previous section. Let the linear map $B: H^*\otimes W^*\rightarrow X^*\otimes Y^*\otimesM^*\otimesN^*$ be defined as

\begin{equation}

B^{hw}_{xyij} = \left\{\begin{array}{lr} 1 &\text{$h,w$ belongs in block $x,y$ at offset $i,j$}\\ 0 &\text{otherwise}\end{array}\right.

B^{hw}_{xymn} = \left\{\begin{array}{lr} 1 &\text{$h,w$ belongs in block $x,y$ at offset $m,n$}\\ 0 &\text{otherwise}\end{array}\right.

\end{equation}

then $B$ can be used to break the image represented by $I$ into blocks of a given size such that the first two indices $x,y$ index the block position and the last two indices $i,j$ index the offset into the block.

then $B$ can be used to break the image represented by $I$ into blocks of a given size such that the first two indices $x,y$ index the block position and the last two indices $m,n$ index the offset into the block.

Next, let the linear map $D: I^*\otimesJ^*\rightarrow A^*\otimes B^*$ be defined as

Next, let the linear map $D: M^*\otimesN^*\rightarrow A^*\otimes B^*$ be defined as

then $D$ represents the 2D discrete forward (and inverse) DCT. Let $Z: A^*\otimes B^*\rightarrow\Gamma^*$ be defined as

where $V(u)$ is a normalizing scale factor. Then $D$ represents the 2D discrete forward (and inverse) DCT. Let $Z: A^*\otimes B^*\rightarrow\Gamma^*$ be defined as

\begin{equation}

Z^{\alpha\beta}_\gamma = \left\{\begin{array}{lr} 1 &\text{$\alpha, \beta$ is at $\gamma$ under zigzag ordering}\\ 0 &\text{otherwise}\end{array}\right.

...

...

@@ -75,7 +75,7 @@ where $q_k$ is a quantization coefficient. This scales the vector entries by the

With linear maps for each step of the JPEG transform, we can then create the $J$ tensor described at the beginning of this section

The inverse mapping also exists as a tensor $\widetilde{J}$ which can be defined using the same linear maps with the exception of $S$. Let $\widetilde{S}$ be

...

...

@@ -86,9 +86,8 @@ The inverse mapping also exists as a tensor $\widetilde{J}$ which can be defined

noting that, for all tensors other than $\widetilde{S}$, we have freely raised and lowered indices without the use of a metric tensor since we consider only the standard orthonormal basis, as stated earlier.

Next consider a linear map $C: H^*\otimes W^*\rightarrow H^*\otimes W^*$ which performs an arbitrary pixel manipulation on an image plane $I$. To apply this mapping to a JPEG image $I'$, we first decompress the image, apply $C$ to the result, then compress that result to get the final JPEG. Since compressing is an application of $J$ and decompressing is an application of $\widetilde{J}$, we can form a new linear map $\Xi: X^*\otimes Y^*\otimes K^*\rightarrow X^*\otimes Y^*\otimes K^*$ as

In this work we showed how to formulate deep residual learning in the JPEG transform domain, and that it provides a notable performance benefit in terms of processing time per image. Our method expresses convolutions as linear maps \cite{smith1994fast} and introduces a novel approximation technique for ReLu. We showed that the approximation can achieve highly performant results with little impact on classification accuracy.

Future work should focus on two main points. The first is efficiency of representation. Our linear maps take up more space than spatial domain convolutions. This makes it hard to scale the networks to datasets with large image sizes. Secondly, library support in commodity deep learning libraries for some of the features required by this algorithm are lacking. As of this writing, true sparse tensor support is missing in all of PyTorch \cite{paszke2017automatic}, TensorFlow \cite{tensorflow2015-whitepaper}, and Caffe \cite{jia2014caffe}, with these tensors being represented as coordinate lists which are known to be highly non-performant. Additionally, the \texttt{einsum} function for evaluating multilinear expressions is not fully optimized in these libraries when compared to the speed of convolutions in libraries like CuDNN \cite{chetlur2014cudnn}.

\ No newline at end of file

Future work should focus on two main points. The first is efficiency of representation. Our linear maps take up more space than spatial domain convolutions. This makes it hard to scale the networks to datasets with large image sizes. Secondly, library support in commodity deep learning libraries for some of the features required by this algorithm are lacking. As of this writing, true sparse tensor support is missing in all of PyTorch \cite{paszke2017automatic}, TensorFlow \cite{tensorflow2015-whitepaper}, and Caffe \cite{jia2014caffe}, with these tensors being represented as coordinate lists which are known to be highly non-performant. Additionally, the \texttt{einsum} function for evaluating multilinear expressions is not fully optimized in these libraries when compared to the speed of convolutions in libraries like CuDNN \cite{chetlur2014cudnn}, though we make use of the \texttt{opt\_einsum}\cite{G2018opt} tool to partially mitigate this.

@@ -13,4 +13,4 @@ The contributions of this work are as follows

\item A model conversion algorithm to apply pretrained spatial domain networks to JPEG images

\item Approximated Spatial Masking: the first general technique for application of piecewise linear functions in the transform domain

\end{enumerate}

By skipping the decompression step and by operating on the sparser compressed format, we show a notable increase in speed for training and inference.

\ No newline at end of file

By skipping the decompression step and by operating on the sparser compressed format, we show a notable increase in speed for testing and a marginal speed for training.

The ResNet architecture, consists of blocks of four basic operations: Convolution (potentially strided), ReLu, Batch Normalization, and Component-wise addition, with the blocks terminating with a global average pooling operation \cite{he2016deep} before a fully connected layer performs the final classification. Our goal will be to develop JPEG domain equivalents to these five operations. Network activations are given as a single tensor holding a batch of multi-channel images, that is $I \in N^*\otimesC^*\otimes H^*\otimes W^*$.

The ResNet architecture, consists of blocks of four basic operations: Convolution (potentially strided), ReLu, Batch Normalization, and Component-wise addition, with the blocks terminating with a global average pooling operation \cite{he2016deep} before a fully connected layer performs the final classification. Our goal will be to develop JPEG domain equivalents to these five operations. Network activations are given as a single tensor holding a batch of multi-channel images, that is $I \in N^*\otimesP^*\otimes H^*\otimes W^*$.

\subsection{Convolution}

The convolution operation follows directly from the discussion in Section \ref{sec:backjlm}. The convolution operation in the spatial domain is a shorthand notation for a linear map $C: N^*\otimesC^*\otimes H^*\otimes W^*\rightarrow N^*\otimesC^*\otimes H^*\otimes W^*$. Since the same operation is applied to each image in the batch, we can represent $C$ with a type (3, 3) tensor. The entries of this tensor give the coefficient for a given pixel in a given input channel for each pixel in each output channel. We now develop the algorithm for representing discrete convolutional filters using this data structure.

The convolution operation follows directly from the discussion in Section \ref{sec:backjlm}. The convolution operation in the spatial domain is a shorthand notation for a linear map $C: N^*\otimesP^*\otimes H^*\otimes W^*\rightarrow N^*\otimesP^*\otimes H^*\otimes W^*$. Since the same operation is applied to each image in the batch, we can represent $C$ with a type (3, 3) tensor. The entries of this tensor give the coefficient for a given pixel in a given input channel for each pixel in each output channel. We now develop the algorithm for representing discrete convolutional filters using this data structure.

A naive algorithm can simply copy randomly initialized convolution weights into this larger structure following the formula for a 2D discrete cross-correlation and then apply the JPEG compression and decompression tensors to the result. However, this is difficult to parallelize and incurs additional memory overhead to store the spatial domain operation. A more efficient algorithm would produce the JPEG domain operation directly and be easy to express as a compute kernel for a GPU. Start by considering the JPEG decompression tensor $\widetilde{J}$. Note that since $\widetilde{J}\in X \otimes Y \otimes K \otimes H^*\otimes W^*$ the last two indices of $\widetilde{J}$ form single channel image under our image model (\eg the last two indices are in $H^*\otimes W^*$). If the convolution can be applied to this "image", then the resulting map would decompress and convolve simultaneously. We can formulate a new tensor $\widehat{J}\in N \otimes H^*\otimes W^*$

by reshaping $\widetilde{J}$ and treating this as a batch of images. Then, given randomly initialized filter weights, $K$ computing

A naive algorithm can simply copy randomly initialized convolution weights into this larger structure following the formula for convolution and then apply the JPEG compression and decompression tensors to the result. However, this is difficult to parallelize and incurs additional memory overhead to store the spatial domain operation. A more efficient algorithm would produce the JPEG domain operation directly and be easy to express as a compute kernel for a GPU. Start by considering the JPEG decompression tensor $\widetilde{J}$. Note that since $\widetilde{J}\in X \otimes Y \otimes K \otimes H^*\otimes W^*$ the last two indices of $\widetilde{J}$ form single channel image under our image model (\eg the last two indices are in $H^*\otimes W^*$). If the convolution can be applied to this "image", then the resulting map would decompress and convolve simultaneously. We can formulate a new tensor $\widehat{J}\in N \otimes H^*\otimes W^*$

by reshaping $\widetilde{J}$ and treating this as a batch of images\footnote{Consider as a concrete example using $32\times32$ images. Then $\widetilde{J}$ is of shape $4\times4\times64\times32\times32$ and the described reshaping gives $\widehat{J}$ of shape $1024\times1\times32\times32$ which can be treated as a batch of size 1024 of $32\times32$ images for convolution.}. Then, given randomly initialized filter weights, $K$ computing

\begin{equation}

\label{eq:sdtojdc}

\widehat{C}^b = K \star\widehat{J}^b

\end{equation}

gives us the desired map. After reshaping $\widehat{C}$ back to the original shape of $\widetilde{J}$ to give $\widetilde{C}$, the full compressed domain operation can be expressed as

where $\star$ indicates the convolution operation and $\widehat{J}^b$ indexes $\widehat{J}$ in the batch dimension, gives us the desired map. After reshaping $\widehat{C}$ back to the original shape of $\widetilde{J}$ to give $\widetilde{C}$, the full compressed domain operation can be expressed as

where $c$ and $c'$ index the input and output channels of the image respectively. This algorithm skips the overhead of computing the spatial domain map explicitly and depends only on the batch convolution operation which is available in all GPU accelerated deep learning libraries. Further, the map can be precomputed to speed up inference by avoiding repeated applications of the convolution. At training time, the gradient of the compression and decompression operators is computed and used to find the gradient of the original convolution filter with respect to the previous layers error, then the map $\Xi$ is updated using the new filter. So, while inference efficiency of the convolution operation is greatly improved, training efficiency is limited by the more complex update. We show in Section \ref{sec:expeff} that the training throughput is still higher than the equivalent spatial domain model.

where $p$ and $p'$ index the input and output channels of the image respectively. This algorithm skips the overhead of computing the spatial domain map explicitly and depends only on the batch convolution operation which is available in all GPU accelerated deep learning libraries. Further, the map can be precomputed to speed up inference by avoiding repeated applications of the convolution. At training time, the gradient of the compression and decompression operators is computed and used to find the gradient of the original convolution filter with respect to the previous layers error, then the map $\Xi$ is updated using the new filter. So, while inference efficiency of the convolution operation is greatly improved, training efficiency is limited by the more complex update. We show in Section \ref{sec:expeff} that the training throughput is still higher than the equivalent spatial domain model.

\subsection{ReLu}

...

...

@@ -49,7 +50,7 @@ where $c$ and $c'$ index the input and output channels of the image respectively

\label{fig:asm}

\end{figure}

Computing ReLu in the JPEG domain is not as straightforward since ReLu is a non-linear function. In general, only linear functions can be composed with the JPEG transform. Recall that the ReLu function is given by

Computing ReLu in the JPEG domain is not as straightforward since ReLu is a non-linear function. Recall that the ReLu function is given by

\begin{equation}

r(x) = \begin{cases}

...

...

@@ -59,35 +60,31 @@ Computing ReLu in the JPEG domain is not as straightforward since ReLu is a non-

\end{equation}

We begin by defining the ReLu in the DCT domain and show how it can be trivially extended to the JPEG transform domain. To do this, we develop a general approximation technique called Approximated Spatial Masking that can apply any piecewise linear function to JPEG compressed images.

To develop this technique we must balance two seemingly competing criteria. The first is that we want to use the JPEG transform domain, since its sparsity has a computational advantage over the spatial domain. The second is that we want to compute a non-linear function which is incompatible with the JPEG transform. Can we balance these two constraints by sacrificing a third criterion? Consider an approximation of the spatial domain image that uses only a subset of the DCT coefficients. Computing this is fast, since it does not use the full set of coefficients, and gives us a spatial domain representation which is compatible with the non-linearity. What we sacrifice is accuracy. The accuracy-speed tradeoff is tunable to the problem by changing the size of the set of coefficients.

To develop this technique we must balance two seemingly competing criteria. The first is that we want to use the JPEG transform domain, since it has a computational advantage over the spatial domain. The second is that we want to compute a non-linear function which is incompatible with the JPEG transform. Can we balance these two constraints by sacrificing a third criterion? Consider an approximation of the spatial domain image that uses only a subset of the DCT coefficients. Computing this is fast, since it does not use the full set of coefficients, and gives us a spatial domain representation which is compatible with the non-linearity. What we sacrifice is accuracy. The accuracy-speed tradeoff is tunable to the problem by changing the size of the set of coefficients.

By Theorem \ref{thm:dctls} we use the lowest $m$ frequencies for an optimal reconstruction. To extend this to the 2D case we adopt the standard definition of the 2D spatial frequency as follows: given a matrix of DCT coefficients, $D$, the spatial frequency $\phi$ of a given entry $(i,j)$ is

\begin{equation}

\phi = i + j

\end{equation}

For the $8\times8$ DCT used in the JPEG algorithm, this gives 15 spatial frequencies total (numbered 0 to 14). We can then fix a maximum number of spatial frequencies $k$ and use all coefficients such that $\phi\leq k$ as our approximation.

By Theorem \ref{thm:dctls} we use the lowest $m$ frequencies for an optimal reconstruction. For the $8\times8$ DCT used in the JPEG algorithm, this gives 15 spatial frequencies total (numbered 0 to 14). We can then fix a maximum number of spatial frequencies $k$ and use all coefficients $\phi$ such that $\phi\leq k$ as our approximation.

If we now compute the piecewise linear function on this approximation directly there are two major problems. The first is that, although the form of the approximation is motivated by a least squares minimization, it is by no means guaranteed to reproduce the original values of any of the pixels. The second is that this gives the value of the function in the spatial domain, and to continue using a JPEG domain network we would need to compress the result which adds computational overhead.

If we now compute the piecewise linear function on this approximation directly there are two major problems. The first is that, although the form of the approximation is motivated by a least squares minimization, it is by no means guaranteed to reproduce the original values of \textit{any} of the pixels. The second is that this gives the value of the function in the spatial domain, and to continue using a JPEG domain network we would need to compress the result which adds computational overhead.

To solve the first problem we examine the intervals that the linear pieces fall into. The larger these intervals are, the more likely we are to have produced a value in the correct interval in our approximation. Further, since the lowest $k$ frequencies minimize the least squared error, the higher the frequency, the less likely it is to push a pixel value out of the given range. With this motivation, we can produce a binary mask for each piece of the function. The linear pieces can then be applied directly to the DCT coefficients, and then multiplied by their masks and summed to give the final result. This preserves all pixel values. The only errors would be in the mask which would result in the wrong linear piece being applied. This is the fundamental idea behind the Approximated Spatial Masking (ASM) technique.

To solve the first problem we examine the intervals that the linear pieces fall into. The larger these intervals are, the more likely we are to have produced a value in the correct interval \footnote{For example if the original pixel value was 0.7 and the approximate value is 0.5, then the approximation is in the correct interval for ReLU ($\geq0$) but its value is incorrect.}in our approximation. Further, since the lowest $k$ frequencies minimize the least squared error, the higher the frequency, the less likely it is to push a pixel value out of the correct range. With this motivation, we can produce a binary mask for each piece of the function. The linear pieces can then be applied directly to the DCT coefficients, and then multiplied by their masks and summed to give the final result. This preserves all pixel values. The only errors would be in the mask which would result in the wrong linear piece being applied. This is the fundamental idea behind the Approximated Spatial Masking (ASM) technique.

The final problem is that we now have a mask in the spatial domain, but the original image is in the DCT domain. There is a well known algorithm for pixelwise multiplication of two DCT images \cite{smith1993algorithms}, but it would require the mask to also be in the DCT domain. Fortunately, there is a straightforward solution that is a result of the multilinear analysis given in Section \ref{sec:backjlm}. Consider the bilinear map

that takes a DCT block, $F$, and a mask $M$, and produces the masked DCT block by pixelwise multiplication. Our task will be to derive the form of $H$. We proceed by construction. The steps of such an algorithm naively would be

that takes a DCT block, $F$, and a spatial mask $G$, and produces the masked DCT block by pixelwise multiplication. Our task will be to derive the form of $H$. We proceed by construction. The steps of such an algorithm naively would be

\begin{enumerate}

\item Take the inverse DCT of F: $I_{ij}= D^{\alpha\beta}_{ij}F_{\alpha\beta}$

\item Pixelwise multiply $I'_{ij}= I_{ij}M_{ij}$

\item Take the DCT of $I'$,$F'_{\alpha'\beta'}= D^{ij}_{\alpha'\beta'}I'_{ij}$.

\item Take the inverse DCT of F: $I_{mn}= D^{\alpha\beta}_{mn}F_{\alpha\beta}$

\item Pixelwise multiply:$I'_{mn}= I_{mn}G_{mn}$

\item Take the DCT of $I'$:$F'_{\alpha'\beta'}= D^{mn}_{\alpha'\beta'}I'_{mn}$.

\end{enumerate}

Since these three steps are linear or bilinear maps, they can be combined

We call $H$ the Harmonic Mixing Tensor since it gives all the spatial frequency permutations that we need. $H$ can be precomputed to speed up computation.

...

...

@@ -110,7 +107,7 @@ This new function can be computed efficiently from fewer spatial frequencies wit

To extend this method from the DCT domain to the JPEG transform domain, the rest of the missing JPEG tensor can simply be applied as follows:

Since the operation is the same for each block, and there are no interactions between blocks, the blocking tensor $B$ can be skipped.

...

...

@@ -144,7 +141,7 @@ To apply $\gamma$ and the variance, we use scalar multiplication. This follows d

\end{equation}

For scalar addition to apply $\beta$, note that since the (0,0) coefficient is the mean, and adding $\beta$ to every pixel in the image is equivalent to raising the mean by $\beta$, we can simply add $\beta$ to the (0,0) coefficient.

To extend this to JPEG is simple. For an $8\times8$ block, the proportionality constant for the (0,0) coefficient becomes $\frac{1}{2\sqrt{2\times8}}=\frac{1}{8}$. For this reason, many quantization matrices use $8$ as the (0,0) quantization coefficient. This means that the 0th block entry for a block does not need any proportionality constant, it stores exactly the mean. So for adding $\beta$, we can simply set the 0th position to $\beta$ without performing additional operations. The other operations are unaffected.

To extend this to JPEG is simple. The proportionality constant for the (0,0) coefficient is $\frac{1}{2\sqrt{2\times8}}=\frac{1}{8}$. For this reason, many quantization matrices use $8$ as the (0,0) quantization coefficient. This means that the 0th block entry for a block does not need any proportionality constant, it stores exactly the mean. So for adding $\beta$, we can simply set the 0th position to $\beta$ without performing additional operations. The other operations are unaffected.

\subsection{Component-wise Addition}

...

...

@@ -163,10 +160,10 @@ meaning that we can simply perform a component-wise addition of the JPEG compres

\label{fig:gap}

\end{figure}

Global average pooling also has a simple formulation in JPEG domain. Recall from the discussion of Batch Normalization (Section \ref{sec:jdrbn}) that the (0,0) DCT coefficient is proportional to the mean of the image, and that the 0th element of the block after quantization is equal to the mean of the block. Then this element can be extracted channel-wise from each block and the global average pooling result is the channel-wise mean of these elements.

Global average pooling also has a simple formulation in JPEG domain. Recall from the discussion of Batch Normalization (Section \ref{sec:jdrbn}) that the 0th element of the block after quantization is equal to the mean of the block. Then this element can be extracted channel-wise from each block and the global average pooling result is the channel-wise mean of these elements.

Furthermore, our network architecture for classification will always reduce the input images to a single block, which can then have its mean extracted and reported as the global average pooling result directly. Note the efficiency of this process: rather than channel-wise averaging in a spatial domain network, we simply have an unconditional read operation, one per channel. This is illustrated in Figure \ref{fig:gap}.

\subsection{Model Conversion}

The previous sections described how to build the ResNet component operations in the JPEG domain. While this implies straightforward algorithms for both inference and learning on JPEGs, we can also convert pre-trained models for JPEG inference. This allows any model that was trained on spatial domain images to benefit from our algorithms at inference time. By following the steps in this section for pre-trained convolution weights and for pre-learned batch normalization parameters, these existing models will work as expected in the JPEG domain. The only caveat is that the ReLu approximation accuracy can effect the final performance of the network since the weights were not trained to cope with it. This is tested in Section \ref{sec:exprla}.

\ No newline at end of file

The previous sections described how to build the ResNet component operations in the JPEG domain. While this implies straightforward algorithms for both inference and learning on JPEGs, we can also convert pre-trained models for JPEG inference. This allows any model that was trained on spatial domain images to benefit from our algorithms at inference time. Consider Equation \ref{eq:sdtojdc}. In this equation, $K$ holds the randomly initialized convolution filter. By instead using pretrained spatial weights for $K$, the convoltion will work as expected on JPEGs. Similarly, pretrained $\alpha, \beta, \mu, \sigma$ for batch normalization can be provided. By doing this for each layer in a pretrained network, the network will operate on JPEGs. The only caveat is that the ReLu approximation accuracy can effect the final performance of the network since the weights were not trained to cope with it. This is tested in Section \ref{sec:exprla}.