We introduce a general method of performing Residual Network inference and learning in the JPEG transform domain that allows the network to consume compressed images as input. Our formulation leverages the linearity of the JPEG transform to redefine convolution and batch normalization with a tune-able numerical approximation for ReLu. The result is mathematically equivalent to the spatial domain network up to the ReLu approximation accuracy. A formulation for image classification and a model conversion algorithm for spatial domain networks are given as examples of the method. We show that the sparsity of the JPEG format allows for faster processing of the images with little to no penalty in the network accuracy.

We introduce a general method of performing Residual Network inference and learning in the JPEG transform domain that allows the network to consume compressed images as input. Our formulation leverages the linearity of the JPEG transform to redefine convolution and batch normalization with a tune-able numerical approximation for ReLu. The result is mathematically equivalent to the spatial domain network up to the ReLu approximation accuracy. A formulation for image classification and a model conversion algorithm for spatial domain networks are given as examples of the method. We show that the sparsity of the JPEG format allows for faster processing of images with little to no penalty in the network accuracy.

@@ -15,9 +15,9 @@ The JPEG compression algorithm is defined as the following steps.

\item Run-length code and entropy code the vectors

\end{enumerate}

This process is repeated independently for each image plane. In most cases, the original image is transformed from the RGB color space to YUV and chroma subsampling is applied since the human visual system is less sensitive to small color changes than to small brightness changes \cite{winkler2001vision}. The decompression algorithm is the inverse process. Note that the rounding step (step 5) must be skipped during decompression, this is the step in JPEG compression where information is lost and is the cause of artifacting in decompressed JPEG images.

This process is repeated independently for each image plane. In most cases, the original image is transformed from the RGB color space to YUV and chroma subsampling is applied since the human visual system is less sensitive to small color changes than to small brightness changes \cite{winkler2001vision}. The decompression algorithm is the inverse process. Note that the rounding step (step 5) must be skipped during decompression. This is the step in JPEG compression where information is lost and is the cause of artifacts in decompressed JPEG images.

The magnitude of the information loss can be tuned using the quantization coefficients. If a larger coefficient is applied in step 4, then the result will be closer to 0 which increases its likelihood of being dropped altogether during rounding. In this way, the JPEG transform forces sparsity on the representation, which why it compresses the image data so well. This is coupled with the tendency of the DCT to push the magnitude of the coefficients into the upper left corner (the DC coefficient and the lowest spatial frequency) to result in high spatial frequencies being dropped. Not only do these high spatial frequencies contribute less response to the human visual system, but they are also the optimal set to drop for a least squares reconstruction of the original image:

The magnitude of the information loss can be tuned using the quantization coefficients. If a larger coefficient is applied in step 4, then the result will be closer to 0 which increases its likelihood of being dropped altogether during rounding. In this way, the JPEG transform forces sparsity on the representation, which is why it compresses image data so well. This is coupled with the tendency of the DCT to push the magnitude of the coefficients into the upper left corner (the DC coefficient and the lowest spatial frequency) resulting in high spatial frequencies being dropped. Not only do these high spatial frequencies contribute less response to the human visual system, but they are also the optimal set to drop for a least squares reconstruction of the original image:

\begin{theorem}[DCT Least Squares Approximation Theorem]

Given a set of $N$ samples of a signal $X =\{x_0, ... x_N\}$, let $Y =\{y_0, ... y_N\}$ be the DCT coefficients of $X$. Then, for any $1\leq m \leq N$, the approximation

...

...

@@ -38,15 +38,15 @@ Theorem \ref{thm:dctls} states that a reconstruction using the $m$ lowest spatia

\label{sec:backjlm}

A key observation of the JPEG algorithm, and the foundation of most compressed domain processing methods \cite{chang1992video, chang1993new, natarajan1995fast, shen1995inner, shen1996direct, shen1998block, smith1993algorithms, smith1994fast} is that steps 1-4 of the JPEG compression algorithm are linear maps, so they can be composed, along with other linear operations, into a single linear map which performs the operations on the compressed representation. Step 5, the rounding step, cannot be undone and Step 6, the entropy coding, is nonlinear and therefore must be undone. We define the JPEG Transform Domain as the output of Step 4 in the JPEG encoding algorithm. Inputs the the algorithms described here will be JPEGs after reversing the entropy coding.

A key observation of the JPEG algorithm, and the foundation of most compressed domain processing methods \cite{chang1992video, chang1993new, natarajan1995fast, shen1995inner, shen1996direct, shen1998block, smith1993algorithms, smith1994fast} is that steps 1-4 of the JPEG compression algorithm are linear maps, so they can be composed, along with other linear operations, into a single linear map which performs the operations on the compressed representation. Step 5, the rounding step, is irreversible and ignored by decompression. Step 6, the entropy coding, is a nonlinear and its form is computed from the data directly, so it is difficult to work with this representation. We define the JPEG Transform Domain as the output of Step 4 in the JPEG encoding algorithm. Inputs to the algorithms described here will be JPEGs after reversing the entropy coding.

Formally, we model a single plane image as the type (0, 2) tensor $I \in H^*\otimes W^*$ where $H$ and $W$ are vector spaces and $*$ denotes the dual space. We always use the standard orthonormal basis for these vector spaces, this is important as it allows the free raising and lowering of indices without the use of a metric tensor.

Formally, we model a single plane image as the type (0, 2) tensor $I \in H^*\otimes W^*$ where $H$ and $W$ are vector spaces and $*$ denotes the dual space. We always use the standard orthonormal basis for these vector spaces which allows the free raising and lowering of indices without the use of a metric tensor.

We define the JPEG transform $J \in H \otimes W \otimes X^*\otimes Y^*\otimes K^*$. $J$ represents a linear map $J: H^*\otimes W^*\rightarrow X^*\otimes Y^*\otimes K^*$ which is computed as (in Einstein notation)

\begin{equation}

I'_{xyk} = J^{hw}_{xyk}I_{hw}

\end{equation}

and we say that $I'$ is the representation of $I$ in the JPEG transform domain. In the above equation, the indices $h,w$ give the pixel position, the indices $x,y$ give the block position, and the index $k$ gives the offset into the block.

and we say that $I'$ is the representation of $I$ in the JPEG transform domain. The indices $h,w$ give pixel position, $x,y$ give block position, and $k$ gives the offset into the block.

The form of $J$ is constructed from the JPEG compression steps listed in the previous section. Let the linear map $B: H^*\otimes W^*\rightarrow X^*\otimes Y^*\otimes I^*\otimes J^*$ be defined as

then $B$ can be used to break the image represented by $I$ into blocks of a given size such that the first two indices $x,y$ index the block position and the last two indices $i,j$ index the offset into the block.

Next. let the linear map $D: I^*\otimes J^*\rightarrow A^*\otimes B^*$ be defined as

Next, let the linear map $D: I^*\otimes J^*\rightarrow A^*\otimes B^*$ be defined as

@@ -70,9 +70,9 @@ then $Z$ creates the zigzag ordered vectors. Finally, let $S: \Gamma^* \rightarr

\begin{equation}

S^\gamma_k = \frac{1}{q_k}

\end{equation}

where $q_k$ is a quantization coefficient, $S$ can be used to scale the vector entries by their quantization coefficients.

where $q_k$ is a quantization coefficient. This scales the vector entries by the quantization coefficients.

With linear maps for each step of the JPEG transform, we can then apply them to each other to create the $J$ tensor that was described at the beginning of this section

With linear maps for each step of the JPEG transform, we can then create the $J$ tensor described at the beginning of this section

noting that, for all tensors other than $\widetilde{S}$, we have freely raised and lowered indices without the use of a metric tensor since we consider only the standard orthonormal basis, as stated earlier.

Next consider a linear map $C: H^*\otimes W^*\rightarrow H^*\otimes W^*$ which performs an arbitrary pixel manipulation on an image plane $I$. To apply this mapping to a JPEG image $I'$, we would first decompress the image, apply $C$ to the result, then compress that result to get the final JPEG. Since compressing is an application of $J$ and decompressing is an application of $\widetilde{J}$, we can form a new linear map $\Xi: X^*\otimes Y^*\otimes K^*\rightarrow X^*\otimes Y^*\otimes K^*$ as

Next consider a linear map $C: H^*\otimes W^*\rightarrow H^*\otimes W^*$ which performs an arbitrary pixel manipulation on an image plane $I$. To apply this mapping to a JPEG image $I'$, we first decompress the image, apply $C$ to the result, then compress that result to get the final JPEG. Since compressing is an application of $J$ and decompressing is an application of $\widetilde{J}$, we can form a new linear map $\Xi: X^*\otimes Y^*\otimes K^*\rightarrow X^*\otimes Y^*\otimes K^*$ as

which applies $C$ in the JPEG transform domain. There are two important points to note about $\Xi$. The first is that, although it encapsulates decompression, applying $C$ and compressing, it uses far fewer operations than doing these processes separately since the coefficients are multiplied out. The second is that it is mathematically equivalent to performing $C$ on the decompressed image and compressing the result, it is not an approximation.

\ No newline at end of file

which applies $C$ in the JPEG transform domain. There are two important points to note about $\Xi$. The first is that, although it encapsulates decompression, applying $C$ and compressing, it uses far fewer operations than doing these processes separately since the coefficients are multiplied out. The second is that it is mathematically equivalent to performing $C$ on the decompressed image and compressing the result. It is not an approximation.

In this work we showed how to formulate deep residual learning in the JPEG transform domain, and we showed that it provides a notable performance benefit in terms of processing time for each image. Our method uses a model of convolutions as a linear map \cite{smith1994fast} and introduces a novel approximation technique for ReLu which, to our knowledge, is the first attempt at defining a non-linear function in the JPEG transform domain. We showed that the approximation can achieve highly performant results with little impact on the overall classification accuracy.

In this work we showed how to formulate deep residual learning in the JPEG transform domain, and that it provides a notable performance benefit in terms of processing time per image. Our method uses expresses convolutions as linear maps\cite{smith1994fast} and introduces a novel approximation technique for ReLu. We showed that the approximation can achieve highly performant results with little impact on classification accuracy.

Future work should focus on two main points. The first is efficiency of representation. Our linear maps take up more space, especially when stored in dense tensor data structures, than spatial domain convolutions. This makes it hard to scale the networks to datasets with large image sizes. Secondly, library support in commodity deep learning libraries for some of the features required by this algorithm are lacking. As of this writing, true sparse tensor support is missing in all of PyTorch \cite{paszke2017automatic}, TensorFlow \cite{tensorflow2015-whitepaper}, and Caffe \cite{jia2014caffe}, with these tensors being represented as coordinate lists which are known to be highly non-performant. Additionally, the \texttt{einsum} function for evaluating multilinear expressions is not fully optimized in these libraries when compared to the speed of convolutions in libraries like CuDNN \cite{chetlur2014cudnn}.

\ No newline at end of file

Future work should focus on two main points. The first is efficiency of representation. Our linear maps take up more space than spatial domain convolutions. This makes it hard to scale the networks to datasets with large image sizes. Secondly, library support in commodity deep learning libraries for some of the features required by this algorithm are lacking. As of this writing, true sparse tensor support is missing in all of PyTorch \cite{paszke2017automatic}, TensorFlow \cite{tensorflow2015-whitepaper}, and Caffe \cite{jia2014caffe}, with these tensors being represented as coordinate lists which are known to be highly non-performant. Additionally, the \texttt{einsum} function for evaluating multilinear expressions is not fully optimized in these libraries when compared to the speed of convolutions in libraries like CuDNN \cite{chetlur2014cudnn}.

We give experimental evidence for the efficacy of our method, starting with a brief discussion of the architectures we use and the datasets for experimentation. We use model conversion as a sanity check, ensuring that the JPEG model with exact ReLu matches exactly the testing accuracy of a spatial domain model. Next we show how the ReLu approximation accuracy effects the overall network performance. We conclude by showing the training and testing time advantage of our method.

We give experimental evidence for the efficacy of our method, starting with a discussion of the architectures we use and the datasets. We use model conversion as a sanity check, ensuring that the JPEG model with exact ReLu matches exactly the testing accuracy of a spatial domain model. Next we show how the ReLu approximation accuracy effects overall network performance. We conclude by showing the training and testing time advantage of our method.

\subsection{Network Architectures and Datasets}

Since we are concerned with reproducing the inference results of spatial domain networks, we choose the MNIST \cite{lecun1998mnist} and CIFAR-10/100 \cite{krizhevsky2009learning} datasets since they are easy to work with. The MNIST images are padded to $32\times32$before being used to ensure an even number of JPEG blocks. Our network architecture is similarly simple is shown in Figure \ref{fig:na}. The classification network consists of three residual blocks with the final two performing downsampling so that the final feature map consists of a single JPEG block.

Since we are concerned with reproducing the inference results of spatial domain networks, we choose the MNIST \cite{lecun1998mnist} and CIFAR-10/100 \cite{krizhevsky2009learning} datasets since they are easy to work with. The MNIST images are padded to $32\times32$ to ensure an even number of JPEG blocks. Our network architecture is shown in Figure \ref{fig:na}. The classification network consists of three residual blocks with the final two performing downsampling so that the final feature map consists of a single JPEG block.

\caption{Simple network architecture. $T$ indicates the batch size.}

\label{fig:na}

\end{figure}

\subsection{Model Conversion}

For this first experiment, we provide empirical evidence that the JPEG formulation presented in this paper is mathematically equivalent to spatial domain network. To show this, we train 100 spatial domain models on each of the three datasets and give their mean testing accuracies. When then use model conversion to transform the pretrained models to the JPEG domain and give the mean testing accuracies of the JPEG models. The images are losslessly JPEG compressed for input to the JPEG networks and the exact (15 spatial frequency) ReLu formulation is used. The result of this test is given in Table \ref{tab:mc}. Since the accuracy difference between the networks is extremely small, the deviation is also included.

For this first experiment, we show empirically that the JPEG formulation is mathematically equivalent to the spatial domain network. To show this, we train 100 spatial domain models on each of the three datasets and give their mean testing accuracies. We then use model conversion to transform the pretrained models to the JPEG domain and give the mean testing accuracies of the JPEG models. The images are losslessly JPEG compressed for input to the JPEG networks and the exact (15 spatial frequency) ReLu formulation is used. The result of this test is given in Table \ref{tab:mc}. Since the accuracy difference between the networks is extremely small, the deviation is also included.

\begin{table}[h]

\centering

...

...

@@ -34,10 +34,11 @@ For this first experiment, we provide empirical evidence that the JPEG formulati

\subsection{ReLu Approximation Accuracy}

\label{sec:exprla}

Next, we examine the impact of the ReLu approximation. We start by examining the raw error on individual $8\times8$ blocks. For this test, we take random $4\times4$ pixel blocks in the range $[-1, 1]$ and scale them to $8\times8$ using a box filter. Fully random $8\times8$ blocks do not accurately represent the statistics of real images and are known to be a worst case for the DCT transform. The $4\times4$ blocks allow for a large random sample size while still approximating real image statistics. We take 10 million such blocks and compute the average RMSE of our Approximated Spatial Masking (ASM) technique and compare it to computing ReLu directly on the approximation (APX). This test is repeated for all one to fifteen spatial frequencies. The result, shown in Figure \ref{fig:rba} shows that our ASM method gives a better approximation (lower RMSE) through the range of spatial frequencies.

Next, we examine the impact of the ReLu approximation. We start by examining the raw error on individual $8\times8$ blocks. For this test, we take random $4\times4$ pixel blocks in the range $[-1, 1]$ and scale them to $8\times8$ using a box filter. Fully random $8\times8$ blocks do not accurately represent the statistics of real images and are known to be a worst case for the DCT transform. The $4\times4$ blocks allow for a large random sample size while still approximating real image statistics. We take 10 million blocks and compute the average RMSE of our ASM technique and compare it to computing ReLu directly on the approximation (APX). This test is repeated for all one to fifteen spatial frequencies. The result, shown in Figure \ref{fig:rba} shows that our ASM method gives a better approximation (lower RMSE) through the range of spatial frequencies.

\begin{figure*}

\centering

\caption{ReLu accuracy results.}

\begin{subfigure}{0.33\textwidth}

\captionsetup{width=.8\linewidth}

\centering

...

...

@@ -61,11 +62,11 @@ Next, we examine the impact of the ReLu approximation. We start by examining the

\end{subfigure}

\end{figure*}

This test provides a strong motivation for the ASM method, so we move on to testing it in the model conversion setting. For this test, we again train 100 spatial domain models and then perform model conversion with the ReLu layers ranging from 1-15 spatial frequencies. We again compare our ASM method with the APX method. The result is given in Figure \ref{fig:ra}, again the ASM method outperforms the APX method.

This test provides a strong motivation for the ASM method, so we move on to testing it in the model conversion setting. For this test, we again train 100 spatial domain models and then perform model conversion with the ReLu layers ranging from 1-15 spatial frequencies. We again compare our ASM method with the APX method. The result is given in Figure \ref{fig:ra}. Again the ASM method outperforms the APX method.

As a final test, we show that if the models are trained in the JPEG domain, the CNN weights will actually learn to cope with the approximation and fewer spatial frequencies are required to get good accuracy. We again compare ASM to APX in this setting. The result shown in Figure \ref{fig:rt} shows that the ASM method again outperforms the APX method and that the network weights have learned to cope with the approximation.

As a final test, we show that if the models are trained in the JPEG domain, the CNN weights will actually learn to cope with the approximation and fewer spatial frequencies are required for good accuracy. The result in Figure \ref{fig:rt} shows that the ASM method again outperforms the APX method and that the network weights have learned to cope with the approximation.

\subsection{Efficiency of Training and Testing}

...

...

@@ -76,4 +77,4 @@ As a final test, we show that if the models are trained in the JPEG domain, the

\label{fig:rt}

\end{figure}

Finally, we show the throughput for training and testing. For this we test on all three datasets by training and testing a spatial model and training and testing a JPEG model and measuring the time taken. This is then converted to an average throughput measurement. The experiment is performed on an NVIDIA Pascal GPU with a batch size of 40 images. The results, shown in Figure \ref{fig:rt}, show that the JPEG model is able to outperform the spatial model in all cases, but that the performance on training is still limited. This is likely because of the more complex gradient created by the convolution and ReLu operations. At inference time, however, performance is greatly improved over the spatial model.

\ No newline at end of file

Finally, we show the throughput for training and testing. For this we test on all three datasets by training and testing a spatial model and training and testing a JPEG model and measuring the time taken. This is then converted to an average throughput measurement. The experiment is performed on an NVIDIA Pascal GPU with a batch size of 40 images. The results, shown in Figure \ref{fig:rt}, show that the JPEG model is able to outperform the spatial model in all cases, but that the performance on training is still limited. This is likely caused by the more complex gradient created by the convolution and ReLu operations. At inference time, however, performance is greatly improved over the spatial model.

@@ -4,7 +4,7 @@ The popularization of deep learning since the 2012 AlexNet \cite{krizhevsky2012i

This problem has been addressed many times in the literature. Batch normalization \cite{ioffe2015batch} is ubiquitous in modern networks to accelerate their convergence. Residual learning \cite{he2016deep} allows for much deeper networks to learn effective mappings without overfitting. Techniques such as pruning and weight compression \cite{han2015deep} are becoming more commonplace. As problems become even larger and more complex, these techniques are increasingly being relied upon for efficient training and inference.

We tackle this problem at the level of the image representation. The JPEG image compression algorithm is the most widespread image file format. Traditionally, the first step in using JPEGs for machine learning is to decompress them. We propose to skip this step and instead reformulate the ResNet architecture to perform its operations directly on compressed images. The goal is to produce a new network that is mathematically equivalent to the spatial domain network, but which operates on compressed images by composing the compression transform into the network weights, which can be done because they are both linear maps. Because the ReLu function is non-linear, we develop an approximation technique for it. This is a general method and, to our knowledge, is the first attempt at formulating a piecewise linear function in the transform domain presented in the literature.

We tackle this problem at the level of the image representation. JPEG is the most widespread image file format. Traditionally, the first step in using JPEGs for machine learning is to decompress them. We propose to skip this step and instead reformulate the ResNet architecture to perform its operations directly on compressed images. The goal is to produce a new network that is mathematically equivalent to the spatial domain network, but which operates on compressed images by including the compression transform into the network weights, which can be done because they are both linear maps. Because the ReLu function is non-linear, we develop an approximation technique for it. This is a general method and, to our knowledge, is the first attempt at formulating a piecewise linear function in the transform domain.

The ResNet architecture, generally, consists of blocks of four basic operations: Convolution (potentially strided), ReLu, Batch Normalization, and Component-wise addition, with the blocks terminating with a global average pooling operation \cite{he2016deep} before a fully connected layer performs the final classification. Our goal will be to develop JPEG domain equivalents to these five operations. Network activations are given as a single tensor holding a batch of multi-channel images, that is $I \in N^*\otimes C^*\otimes H^*\otimes W^*$.

The ResNet architecture, consists of blocks of four basic operations: Convolution (potentially strided), ReLu, Batch Normalization, and Component-wise addition, with the blocks terminating with a global average pooling operation \cite{he2016deep} before a fully connected layer performs the final classification. Our goal will be to develop JPEG domain equivalents to these five operations. Network activations are given as a single tensor holding a batch of multi-channel images, that is $I \in N^*\otimes C^*\otimes H^*\otimes W^*$.

\subsection{Convolution}

The convolution operation follows directly from the discussion in Section \ref{sec:backjlm}. The convolution operation in the spatial domain is a shorthand notation for a linear map $C: N^*\otimes C^*\otimes H^*\otimes W^*\rightarrow N^*\otimes C^*\otimes H^*\otimes W^*$. Since the same operation is applied to each image in the batch, we can represent $C$ with a type (3, 3) tensor. The entries of this tensor give the coefficient for a given pixel in a given input channel for each pixel in each output channel. This notation can describe any multichannel linear pixel manipulation. We now develop the algorithm for representing discrete convolutional filters using this data structure.

The convolution operation follows directly from the discussion in Section \ref{sec:backjlm}. The convolution operation in the spatial domain is a shorthand notation for a linear map $C: N^*\otimes C^*\otimes H^*\otimes W^*\rightarrow N^*\otimes C^*\otimes H^*\otimes W^*$. Since the same operation is applied to each image in the batch, we can represent $C$ with a type (3, 3) tensor. The entries of this tensor give the coefficient for a given pixel in a given input channel for each pixel in each output channel. We now develop the algorithm for representing discrete convolutional filters using this data structure.

A naive algorithm can simply copy randomly initialized convolution weights into this larger structure following the formula for a 2D discrete cross-correlation and then apply the JPEG compression and decompression tensors to the result. However, this is difficult to parallelize and incurs additional memory overhead to store the spatial domain operation. A more efficient algorithm would produce the JPEG domain operation directly and be easy to express as a compute kernel for a GPU. Start by considering the JPEG decompression tensor $\widetilde{J}$. Note that since $\widetilde{J}\in X \otimes Y \otimes K \otimes H^*\otimes W^*$ the last two indices of $\widetilde{J}$ form single channel image under our image model (\eg the last two indices are in $H^*\otimes W^*$). If the convolution can be applied to this "image", then the resulting map would decompress and convolve simultaneously. We can formulate a new tensor $\widehat{J}\in N \otimes H^*\otimes W^*$

by reshaping $\widetilde{J}$ and treating this as a batch of images, then, given randomly initialized filter weights, $K$ computing

by reshaping $\widetilde{J}$ and treating this as a batch of images. Then, given randomly initialized filter weights, $K$ computing

\begin{equation}

\widehat{C}^b = K \star\widehat{J}^b

\end{equation}

...

...

@@ -15,10 +15,17 @@ gives us the desired map. After reshaping $\widehat{C}$ back to the original sha

where $m$ and $n$ index the input and output channels of the image respectively. This algorithm skips the overhead of computing the spatial domain map explicitly and depends only on the batch convolution operation which is available in all GPU accelerated deep learning libraries. Further, the map can be precomputed to speed up inference by avoiding repeated applications of the convolution.

where $c$ and $c'$ index the input and output channels of the image respectively. This algorithm skips the overhead of computing the spatial domain map explicitly and depends only on the batch convolution operation which is available in all GPU accelerated deep learning libraries. Further, the map can be precomputed to speed up inference by avoiding repeated applications of the convolution.

\caption{Example of ASM ReLu on an $8\times8$ block. Green pixels are negative, red pixels are positive, and blue pixels are zero. Left: original image. Middle: ReLu. Right: ReLu approximation using ASM.}

\label{fig:asm}

\end{figure}

Computing ReLu in the JPEG domain is not as straightforward since ReLu is a non-linear function. In general, only linear functions can be composed with the JPEG transform. Recall that the ReLu function is given by

\begin{equation}

...

...

@@ -27,7 +34,7 @@ Computing ReLu in the JPEG domain is not as straightforward since ReLu is a non-

0 & x \leq 0

\end{cases}

\end{equation}

Although this is one of the simplest piecewise linear functions to study, it still exhibits highly non-linear behavior. We begin by defining the ReLu in the DCT domain and show how it can be trivially extended to the JPEG transform domain. To do this, we develop a general approximation technique called Approximated Spatial Masking that can apply any piecewise linear function to JPEG compressed images.

We begin by defining the ReLu in the DCT domain and show how it can be trivially extended to the JPEG transform domain. To do this, we develop a general approximation technique called Approximated Spatial Masking that can apply any piecewise linear function to JPEG compressed images.

To develop this technique we must balance two seemingly competing criteria. The first is that we want to use the JPEG transform domain, since its sparsity has a computational advantage over the spatial domain. The second is that we want to compute a non-linear function which is incompatible with the JPEG transform. Can we balance these two constraints by sacrificing a third criterion? Consider an approximation of the spatial domain image that uses only a subset of the DCT coefficients. Computing this is fast, since it does not use the full set of coefficients, and gives us a spatial domain representation which is compatible with the non-linearity. What we sacrifice is accuracy. The accuracy-speed tradeoff is tunable to the problem by changing the size of the set of coefficients.

...

...

@@ -39,9 +46,9 @@ For the $8 \times 8$ DCT used in the JPEG algorithm, this gives 15 spatial frequ

If we now compute the piecewise linear function on this approximation directly there are two major problems. The first is that, although the form of the approximation is motivated by a least squares minimization, it is by no means guaranteed to reproduce the original values of any of the pixels. The second is that this gives the value of the function in the spatial domain, and to continue using a JPEG domain network we would need to compress the result which adds computational overhead.

What we can do to solve the first problem is to look at the ranges that the linear pieces fall into. The larger these ranges are, the more likely we are to have produced a value in the correct range in our approximation. Further, since the lowest $k$ frequencies minimize the least squared error, the higher the frequency, the less likely it is to push a pixel value out of the given range. With this motivation, we can produce a binary mask for each piece of the function. The linear pieces can then be applied directly to the DCT coefficients, and then multiplied by the mask and summed to give the final result. This preserves all pixel values, the only errors would be in the mask which would result in the wrong linear piece being applied. This is the fundamental idea behind the Approximated Spatial Masking (ASM) technique.

To solve the first problem we examine ranges that the linear pieces fall into. The larger these ranges are, the more likely we are to have produced a value in the correct range in our approximation. Further, since the lowest $k$ frequencies minimize the least squared error, the higher the frequency, the less likely it is to push a pixel value out of the given range. With this motivation, we can produce a binary mask for each piece of the function. The linear pieces can then be applied directly to the DCT coefficients, and then multiplied by the mask and summed to give the final result. This preserves all pixel values. The only errors would be in the mask which would result in the wrong linear piece being applied. This is the fundamental idea behind the Approximated Spatial Masking (ASM) technique.

The final problem is that we now have a mask in the spatial domain, but the original image is in the DCT domain. There is a well known algorithm for pixelwise multiplication of two DCT images \cite{smith1993algorithms}, but it would require the mask to also be in the DCT domain. Fortunately, there is a straightforward solution that comes as a result of the multilinear analysis given in Section \ref{sec:backjlm}. Consider the bilinear map

The final problem is that we now have a mask in the spatial domain, but the original image is in the DCT domain. There is a well known algorithm for pixelwise multiplication of two DCT images \cite{smith1993algorithms}, but it would require the mask to also be in the DCT domain. Fortunately, there is a straightforward solution that is a result of the multilinear analysis given in Section \ref{sec:backjlm}. Consider the bilinear map

We call $H$ the Harmonic Mixing Tensor since it gives all the spatial frequency permutations that we need. $H$ can be precomputed to speed up the resulting computation.

We call $H$ the Harmonic Mixing Tensor since it gives all the spatial frequency permutations that we need. $H$ can be precomputed to speed up computation.

To use this technique to compute the ReLu function, consider this alternative formulation

\newcommand{\nnm}{\mathrm{nnm}}

...

...

@@ -71,17 +78,17 @@ To use this technique to compute the ReLu function, consider this alternative fo

0 & x \leq 0

\end{cases}

\end{equation}

We call the function $\nnm(x)$ the nonnegative mask of $x$, this is our binary mask for ASM. We express the ReLu function as

We call the function $\nnm(x)$ the nonnegative mask of $x$. This is our binary mask for ASM. We express the ReLu function as

\begin{equation}

r(x) = \nnm(x)x

\end{equation}

This new function can be computed efficiently from few spatial frequencies with much higher accuracy since only the sign of the original function needs to be correct. This algorithm is given in the supplementary material. To extend this method from the DCT domain to the JPEG transform domain, the rest of the missing JPEG tensor can simply be applied to $H$.

This new function can be computed efficiently from fewer spatial frequencies with much higher accuracy since only the sign of the original function needs to be correct. Figure \ref{fig:asm} gives an example of this algorithm on a random block, and pseudocode is given in the supplementary material. To extend this method from the DCT domain to the JPEG transform domain, the rest of the missing JPEG tensor can simply be applied to $H$.

\subsection{Batch Normalization}

\label{sec:jdrbn}

Batch normalization \cite{ioffe2015batch} has a simple and efficient formulation in the JPEG domain. Recall that batch normalization defines two learnable parameters: $\gamma$ and $\beta$. A given feature map $I$ is first centered and the normalized over the batch, then scaled by $\gamma$ and translated by $\beta$. The full formula is given by

Batch normalization \cite{ioffe2015batch} has a simple and efficient formulation in the JPEG domain. Recall that batch normalization defines two learnable parameters: $\gamma$ and $\beta$. A given feature map $I$ is first centered and then normalized over the batch, then scaled by $\gamma$ and translated by $\beta$. The full is

In other words, the (0,0) DCT coefficient is proportional to the mean of the block. Further, since the DCT basis is orthonormal, we can be sure that the remaining DCT coefficients do not depend on the mean. This means that to center the image we need only set the (0,0) DCT coefficient to 0. This, of course, is unaffected by the other steps of the JPEG transform. For tracking the running mean, we simply read this value. Note that this is a much more efficient operation than the mean computation in the spatial domain.

Next, to get the variance, we can use the following theorem:

Next, to get the variance, we use the following theorem:

\begin{theorem}[The DCT Mean-Variance Theorem]

Given a set of samples of a signal $X$ such that $\e[X]=0$, let $Y$ be the DCT coefficients of $X$. Then

\begin{equation}

...

...

@@ -108,7 +115,7 @@ To apply $\gamma$ and the variance, we use scalar multiplication. This follows d

\end{equation}

For scalar addition to apply $\beta$, note that since the (0,0) coefficient is the mean, and adding $\beta$ to every pixel in the image is equivalent to raising the mean by $\beta$, we can simply add $\beta$ to the (0,0) coefficient.

To extend this to JPEG is simple. For an $8\times8$ block, the proportionality constant for the (0,0) coefficient becomes $\frac{1}{2\sqrt{2\times8}}=\frac{1}{8}$. For this reason, many quantization matrices use $8$ as the (0,0) quantization coefficient. This means that the 0th block entry for a block does not need any proportionality constant, it stores exactly the mean. This means for adding $\beta$, we can simply set the 0th position to $\beta$ without performing and mathematical operations. The other operations are unaffected.

To extend this to JPEG is simple. For an $8\times8$ block, the proportionality constant for the (0,0) coefficient becomes $\frac{1}{2\sqrt{2\times8}}=\frac{1}{8}$. For this reason, many quantization matrices use $8$ as the (0,0) quantization coefficient. This means that the 0th block entry for a block does not need any proportionality constant, it stores exactly the mean. This means for adding $\beta$, we can simply set the 0th position to $\beta$ without performing additional operations. The other operations are unaffected.

\subsection{Component-wise Addition}

...

...

@@ -120,9 +127,16 @@ meaning that we can simply perform a component-wise addition of the JPEG compres

\caption{Global average pooling. The 0th coefficient of each block can be used directly with no computation.}

\label{fig:gap}

\end{figure}

Global average pooling also has a simple formulation in JPEG domain. Recall from the discussion of Batch Normalization (Section \ref{sec:jdrbn}) that the (0,0) DCT coefficient is proportional to the mean of the image, and that the 0th element of the block after quantization is equal to the mean of the block. Then this element can be extracted channel-wise from each block and the global average pooling result is the channel-wise mean of these elements.

Furthermore, our network architecture for classification will always reduce the input images to a single block, which can then have its mean extracted and reported as the global average pooling result directly. Note the efficiency of this process, rather than the channel-wise averaging in a spatial domain network, we simply have an unconditional read operation, one per channel.

Furthermore, our network architecture for classification will always reduce the input images to a single block, which can then have its mean extracted and reported as the global average pooling result directly. Note the efficiency of this process: rather than channel-wise averaging in a spatial domain network, we simply have an unconditional read operation, one per channel. This is illustrated in Figure \ref{fig:gap}.

We briefly review prior work separated into three categories: compressed domain operations, machine learning in the compressed domain, and deep learning in the compressed domain.

We review prior work separated into three categories: compressed domain operations, machine learning in the compressed domain, and deep learning in the compressed domain.

\subsection{Compressed Domain Operations}

The expression of common operations in the compressed domain was an extremely active area of study in the late 80s and early 90s, motivated by the lack of computing power to quickly decompress, process, and recompress images and video. On the JPEG side, Smith and Rowe \cite{smith1993algorithms} formulate fast JPEG compatible algorithms for performing scalar and pixelwise addition and multiplication. This was extended by Shen and Sethi \cite{shen1995inner} to general blockwise operations and by Smith \cite{smith1994fast} to arbitrary linear maps. Natarajan and Vasudev \cite{natarajan1995fast} additionally formulate an extremely fast approximate algorithm for scaling JPEG images. On the MPEG side, Chang \etal\cite{chang1992video} introduce the basic algorithms for manipulating compressed video. Chang and Messerschmitt \cite{chang1993new} give a fast algorithm for decoding motion compensation before DCT which allows arbitrary video compositing operations to be performed.

The expression of common operations in the compressed domain was an extremely active area of study in the late 80s and early 90s, motivated by the lack of computing power to quickly decompress, process, and recompress images and video. For JPEG, Smith and Rowe \cite{smith1993algorithms} formulate fast JPEG compatible algorithms for performing scalar and pixelwise addition and multiplication. This was extended by Shen and Sethi \cite{shen1995inner} to general blockwise operations and by Smith \cite{smith1994fast} to arbitrary linear maps. Natarajan and Vasudev \cite{natarajan1995fast} additionally formulate an extremely fast approximate algorithm for scaling JPEG images. For MPEG, Chang \etal\cite{chang1992video} introduce the basic algorithms for manipulating compressed video. Chang and Messerschmitt \cite{chang1993new} give a fast algorithm for decoding motion compensation before DCT which allows arbitrary video compositing operations to be performed.

\subsection{Machine Learning in the Compressed Domain}

...

...

@@ -12,4 +12,4 @@ Compressed domain machine learning grew out of the work in the mid 90s. Arman \e

\subsection{Deep Learning in the Compressed Domain}

Because deep learning in particular is a non-linear map, it has received limited study in the compressed domain. Ghosh and Chellappa \cite{ghosh2016deep} use a DCT as part of their network's first layer and show that it speeds up convergence for training. Wu \etal\cite{wu2018compressed} formulate a deep network for video action recognition that uses a separate network for i-frames and p-frames. Since the p-frame network functions on raw motion vectors and error residuals it is considered compressed domain processing, although it works in the spatial domain and not the quantized frequency domain as in this work. Wu \etal show a significant efficiency advantage compared to traditional 3D convolution architectures, which they attribute to the p-frame data being a minimal representation of the video motion. Gueguen \etal\cite{gueguen_2018_ICLR} formulate a traditional ResNet that operates on DCT coefficients directly instead of pixels, \eg the DCT coefficients are fed to the network. They show that learning is able to converge faster on this input, further motivating the JPEG representation.

\ No newline at end of file

Because deep networks are non-linear maps, deep learning has received limited study in the compressed domain. Ghosh and Chellappa \cite{ghosh2016deep} use a DCT as part of their network's first layer and show that it speeds up convergence for training. Wu \etal\cite{wu2018compressed} formulate a deep network for video action recognition that uses a separate network for i-frames and p-frames. Since the p-frame network functions on raw motion vectors and error residuals it is considered compressed domain processing, although it works in the spatial domain and not the quantized frequency domain as in this work. Wu \etal show a significant efficiency advantage compared to traditional 3D convolution architectures, which they attribute to the p-frame data being a minimal representation of the video motion. Gueguen \etal\cite{gueguen_2018_ICLR} formulate a traditional ResNet that operates on DCT coefficients directly instead of pixels, \eg the DCT coefficients are fed to the network. They show that learning converges faster on this input, further motivating the JPEG representation.