We briefly review the JPEG compression/decompression algorithm \cite{wallace1992jpeg} and introduce the tensor method that we use to formulate our networks \cite{smith1994fast}.

We briefly review the JPEG compression/decompression algorithm \cite{wallace1992jpeg} and introduce the multilinear method that we use to formulate our networks \cite{smith1994fast}.

\subsection{JPEG Compression}

...

...

@@ -17,7 +17,7 @@ The JPEG compression algorithm is defined as the following steps.

This process is repeated independently for each image plane. In most cases, the original image is transformed from the RGB color space to YUV and chroma subsampling is applied since the human visual system is less sensitive to small color changes than to small brightness changes \cite{winkler2001vision}. The decompression algorithm is the inverse process. Note that the rounding step (step 5) must be skipped during decompression, this is the step in JPEG compression where information is lost and is the cause of artifacting in decompressed JPEG images.

The magnitude of the information loss can be tuned using the quantization coefficients. If a larger coefficient is applied in step 4, then the result will be closer to 0 which increases its likelihood of being dropped altogether. In this way, the JPEG transform forces sparsity on the representation, which why it compresses the image data so well. This is coupled with the tendency of the DCT to push the magnitude of the coefficients into the upper left corner (the DC coefficient and the lowest spatial frequency) to result in high spatial frequencies being dropped. Not only do these high spatial frequencies contribute less response to the human visual system, but they are also the optimal set to drop for a least squares reconstruction of the original image:

The magnitude of the information loss can be tuned using the quantization coefficients. If a larger coefficient is applied in step 4, then the result will be closer to 0 which increases its likelihood of being dropped altogether during rounding. In this way, the JPEG transform forces sparsity on the representation, which why it compresses the image data so well. This is coupled with the tendency of the DCT to push the magnitude of the coefficients into the upper left corner (the DC coefficient and the lowest spatial frequency) to result in high spatial frequencies being dropped. Not only do these high spatial frequencies contribute less response to the human visual system, but they are also the optimal set to drop for a least squares reconstruction of the original image:

\begin{theorem}[DCT Least Squares Approximation Theorem]

Given a set of $N$ samples of a signal $X =\{x_0, ... x_N\}$, let $Y =\{y_0, ... y_N\}$ be the DCT coefficients of $X$. Then, for any $1\leq m \leq N$, the approximation

...

...

@@ -40,7 +40,7 @@ Theorem \ref{thm:dctls} states that a reconstruction using the $m$ lowest spatia

A key observation of the JPEG algorithm, and the foundation of most compressed domain processing methods \cite{chang1992video, chang1993new, natarajan1995fast, shen1995inner, shen1996direct, shen1998block, smith1993algorithms, smith1994fast} is that steps 1-4 of the JPEG compression algorithm are linear maps, so they can be composed, along with other linear operations, into a single linear map which performs the operations on the compressed representation. Step 5, the rounding step, cannot be undone and Step 6, the entropy coding, is nonlinear and therefore must be undone. We define the JPEG Transform Domain as the output of Step 4 in the JPEG encoding algorithm. Inputs the the algorithms described here will be JPEGs after reversing the entropy coding.

Formally, we model a single plane image as the type (0, 2) tensor $I \in H^*\otimes W^*$ where $H$ and $W$ are vector spaces and $*$ denotes the dual space. We always the standard orthonormal basis for these vector spaces, this is important as it allows the free raising and lowering of indices without the use of a metric tensor.

Formally, we model a single plane image as the type (0, 2) tensor $I \in H^*\otimes W^*$ where $H$ and $W$ are vector spaces and $*$ denotes the dual space. We always use the standard orthonormal basis for these vector spaces, this is important as it allows the free raising and lowering of indices without the use of a metric tensor.

We define the JPEG transform $J \in H \otimes W \otimes X^*\otimes Y^*\otimes K^*$. $J$ represents a linear map $J: H^*\otimes W^*\rightarrow X^*\otimes Y^*\otimes K^*$ which is computed as (in Einstein notation)

In this work we showed how to formulate deep residual learning in the JPEG transform domain, and we showed that it provides a notable performance benefit in terms of processing time for each image. Our method uses a model of convolutions as a linear map \cite{smith1994fast} and introduces a novel approximation technique for ReLu which, to our knowledge, is the first rigorous attempt at defining a non-linear function in the JPEG transform domain. We showed that the approximation can achieve highly performant results with little impact on the overall classification accuracy.

In this work we showed how to formulate deep residual learning in the JPEG transform domain, and we showed that it provides a notable performance benefit in terms of processing time for each image. Our method uses a model of convolutions as a linear map \cite{smith1994fast} and introduces a novel approximation technique for ReLu which, to our knowledge, is the first attempt at defining a non-linear function in the JPEG transform domain. We showed that the approximation can achieve highly performant results with little impact on the overall classification accuracy.

Future work should focus on two main points. The first is efficiency of representation. Our linear maps take up more space, especially when stored in dense tensor data structures, than spatial domain convolutions. This makes it hard to scale the networks to datasets with large image sizes. Secondly, library support in commodity deep learning libraries for some of the features required by this algorithm are lacking. As of this writing, true sparse tensor support is missing in all of PyTorch \cite{paszke2017automatic}, TensorFlow \cite{tensorflow2015-whitepaper}, and Caffe \cite{jia2014caffe}, with these tensors being represented as coordinate lists which are known to be highly non-performant. Additionally, the \texttt{einsum} function for evaluating multilinear expressions is not fully optimized in these libraries when compared to the speed of convolutions in libraries like CuDNN \cite{chetlur2014cudnn}.

@@ -15,7 +15,7 @@ Since we are concerned with reproducing the inference results of spatial domain

\subsection{Model Conversion}

For this first experiment, we provide empirical evidence that the JPEG formulation presented in this paper is mathematically equivalent to spatial domain network. To show this, we train 100 spatial domain models on each of three datasets and give their mean testing accuracies. When then use model conversion to transform the pretrained models to the JPEG domain and give the mean testing accuracies of the JPEG models. The images are losslessly JPEG compressed for input to the JPEG networks and the exact (15 spatial frequency) ReLu formulation is used. The result of this test is given in Table \ref{tab:mc}. Since the accuracy difference between the networks is extremely small, the deviation is also included.

For this first experiment, we provide empirical evidence that the JPEG formulation presented in this paper is mathematically equivalent to spatial domain network. To show this, we train 100 spatial domain models on each of the three datasets and give their mean testing accuracies. When then use model conversion to transform the pretrained models to the JPEG domain and give the mean testing accuracies of the JPEG models. The images are losslessly JPEG compressed for input to the JPEG networks and the exact (15 spatial frequency) ReLu formulation is used. The result of this test is given in Table \ref{tab:mc}. Since the accuracy difference between the networks is extremely small, the deviation is also included.

@@ -4,7 +4,7 @@ The popularization of deep learning since the 2012 AlexNet \cite{krizhevsky2012i

This problem has been addressed many times in the literature. Batch normalization \cite{ioffe2015batch} is ubiquitous in modern networks to accelerate their convergence. Residual learning \cite{he2016deep} allows for much deeper networks to learn effective mappings without overfitting. Techniques such as pruning and weight compression \cite{han2015deep} are becoming more commonplace. As problems become even larger and more complex, these techniques are increasingly being relied upon for efficient training and inference.

We tackle this problem at the level of the image representation. The JPEG image compression algorithm is the most widespread image file format. Traditionally, the first step in using JPEGs for machine learning is to decompress them. We propose to skip this step and instead reformulate the ResNet architecture to perform its operations directly on compressed images. The goal is to produce a new network that is mathematically equivalent to the spatial domain network, but which operates on compressed images by composing the compression transform into the network weights, which can be done because they are both linear maps. Because of the ReLu function is non-linear, we develop an approximation technique for it. This is a general method and, to our knowledge, is the first attempt at formulating a piecewise linear function in the transform domain presented in the literature.

We tackle this problem at the level of the image representation. The JPEG image compression algorithm is the most widespread image file format. Traditionally, the first step in using JPEGs for machine learning is to decompress them. We propose to skip this step and instead reformulate the ResNet architecture to perform its operations directly on compressed images. The goal is to produce a new network that is mathematically equivalent to the spatial domain network, but which operates on compressed images by composing the compression transform into the network weights, which can be done because they are both linear maps. Because the ReLu function is non-linear, we develop an approximation technique for it. This is a general method and, to our knowledge, is the first attempt at formulating a piecewise linear function in the transform domain presented in the literature.

@@ -6,16 +6,16 @@ The ResNet architecture, generally, consists of blocks of four basic operations:

The convolution operation follows directly from the discussion in Section \ref{sec:backjlm}. The convolution operation in the spatial domain is a shorthand notation for a linear map $C: N^*\otimes C^*\otimes H^*\otimes W^*\rightarrow N^*\otimes C^*\otimes H^*\otimes W^*$. Since the same operation is applied to each image in the batch, we can represent $C$ with a type (3, 3) tensor. The entries of this tensor give the coefficient for a given pixel in a given input channel for each pixel in each output channel. This notation can describe any multichannel linear pixel manipulation. We now develop the algorithm for representing discrete convolutional filters using this data structure.

A naive algorithm can simply copy randomly initialized convolution weights into this larger structure following the formula for a 2D discrete cross-correlation and then apply the JPEG compression and decompression tensors to the result, but this is difficult to parallelize and incurs additional memory overhead to store the spatial domain operation. A more efficient algorithm would produce the JPEG domain operation directly and be easy to express as a compute kernel for a GPU. Start by considering the JPEG decompression tensor $\widetilde{J}$. Note that since $\widetilde{J}\in X \otimes Y \otimes K \otimes H^*\otimes W^*$ the last two indices of $\widetilde{J}$ form single channel image under our image model (\eg the last two indices are in $H^*\otimes W^*$). If the convolution can be applied to this "image", then the resulting map would decompress and convolve simultaneously. We can formulate a new tensor $\widehat{J}\in N \otimes H^*\otimes W^*$

by reshaping $\widetilde{J}$ and treating this as a batch of images, then, given the initialized convolution filter, $K$ computing

A naive algorithm can simply copy randomly initialized convolution weights into this larger structure following the formula for a 2D discrete cross-correlation and then apply the JPEG compression and decompression tensors to the result. However, this is difficult to parallelize and incurs additional memory overhead to store the spatial domain operation. A more efficient algorithm would produce the JPEG domain operation directly and be easy to express as a compute kernel for a GPU. Start by considering the JPEG decompression tensor $\widetilde{J}$. Note that since $\widetilde{J}\in X \otimes Y \otimes K \otimes H^*\otimes W^*$ the last two indices of $\widetilde{J}$ form single channel image under our image model (\eg the last two indices are in $H^*\otimes W^*$). If the convolution can be applied to this "image", then the resulting map would decompress and convolve simultaneously. We can formulate a new tensor $\widehat{J}\in N \otimes H^*\otimes W^*$

by reshaping $\widetilde{J}$ and treating this as a batch of images, then, given randomly initialized filter weights, $K$ computing

\begin{equation}

\widehat{C}^b = \widehat{J}^b \star K

\widehat{C}^b = K \star\widehat{J}^b

\end{equation}

gives us the desired map. After reshaping $\widehat{C}$ back to the original shape of $\widetilde{J}$ to give $\widetilde{C}$, the full compressed domain operation can be expressed as

where $m$ and $n$ index the input and output channels of the image respectively. This algorithm skips the overhead of computing the spatial domain map explicitly and depends only on the batch convolution operation which is available in all GPU accelerated deep learning libraries. The algorithm is shown in the supplementary material.

where $m$ and $n$ index the input and output channels of the image respectively. This algorithm skips the overhead of computing the spatial domain map explicitly and depends only on the batch convolution operation which is available in all GPU accelerated deep learning libraries. Further, the map can be precomputed to speed up inference by avoiding repeated applications of the convolution.

\subsection{ReLu}

...

...

@@ -29,7 +29,7 @@ Computing ReLu in the JPEG domain is not as straightforward since ReLu is a non-

\end{equation}

Although this is one of the simplest piecewise linear functions to study, it still exhibits highly non-linear behavior. We begin by defining the ReLu in the DCT domain and show how it can be trivially extended to the JPEG transform domain. To do this, we develop a general approximation technique called Approximated Spatial Masking that can apply any piecewise linear function to JPEG compressed images.

To develop this technique we must balance two seemingly competing criteria. The first is that we want to use the JPEG transform domain, since its sparsity has a computational advantage over the spatial domain. The second is that we want to compute a non-linear function which is incompatible with the JPEG transform. Can we balance these two constrains by sacrificing a third criterion? Consider an approximation of the spatial domain image that uses only a subset of the DCT coefficients. Computing this is fast, since it does not use the full set of coefficients, and gives us a spatial domain representation which is compatible with the non-linearity. What we sacrifice is accuracy. The accuracy-speed tradeoff is tunable to the problem by changing the size of the set of coefficients.

To develop this technique we must balance two seemingly competing criteria. The first is that we want to use the JPEG transform domain, since its sparsity has a computational advantage over the spatial domain. The second is that we want to compute a non-linear function which is incompatible with the JPEG transform. Can we balance these two constraints by sacrificing a third criterion? Consider an approximation of the spatial domain image that uses only a subset of the DCT coefficients. Computing this is fast, since it does not use the full set of coefficients, and gives us a spatial domain representation which is compatible with the non-linearity. What we sacrifice is accuracy. The accuracy-speed tradeoff is tunable to the problem by changing the size of the set of coefficients.

By Theorem \ref{thm:dctls} we use the lowest $m$ frequencies for an optimal reconstruction. To extend this to the 2D case we adopt the standard definition of the 2D spatial frequency as follows: given a matrix of DCT coefficients, $D$, the spatial frequency $\phi$ of a given entry $(i,j)$ is

\begin{equation}

...

...

@@ -51,7 +51,7 @@ that takes a DCT block, $F$, and a mask $M$, and produces the masked DCT block b

\item Pixelwise multiply $I'_{ij}= I_{ij}M_{ij}$

\item Take the DCT of $I$, $F'_{\alpha'\beta'}= D^{ij}_{\alpha'\beta'}I'_{ij}$.

\end{enumerate}

Since these three steps all have linear maps, they can be combined

Since these three steps are linear or bilinear maps, they can be combined

@@ -126,4 +126,4 @@ Furthermore, our network architecture for classification will always reduce the

\subsection{Model Conversion}

The previous sections described how to build the ResNet component operations in the JPEG domain. While this implies straightforward algorithms for both inference and learning on JPEGs, we can also convert pre-trained models for JPEG inference. This allows any model that was trained on spatial domain images to benefit from our algorithms at inference time. By following the steps in these section for pre-trained convolution weights and for pre-learned batch normalization parameters, these existing models will work as expected in the JPEG domain. The only caveat is that the ReLu approximation accuracy can effect the final performance of the network since the weights were not trained to cope with it. This is tested in Section \ref{sec:exprla} and varies on a case-by-case basis, sometimes even improving the accuracy over the exact ReLu (though this is not guarnanteed or even predictable).

\ No newline at end of file

The previous sections described how to build the ResNet component operations in the JPEG domain. While this implies straightforward algorithms for both inference and learning on JPEGs, we can also convert pre-trained models for JPEG inference. This allows any model that was trained on spatial domain images to benefit from our algorithms at inference time. By following the steps in this section for pre-trained convolution weights and for pre-learned batch normalization parameters, these existing models will work as expected in the JPEG domain. The only caveat is that the ReLu approximation accuracy can effect the final performance of the network since the weights were not trained to cope with it. This is tested in Section \ref{sec:exprla}.