@@ -38,15 +38,15 @@ Theorem \ref{thm:dctls} states that a reconstruction using the $m$ lowest spatia

\label{sec:backjlm}

A key observation of the JPEG algorithm, and the foundation of most compressed domain processing methods \cite{chang1992video, chang1993new, natarajan1995fast, shen1995inner, shen1996direct, shen1998block, smith1993algorithms, smith1994fast} is that steps 1-4 of the JPEG compression algorithm are linear maps, so they can be composed, along with other linear operations, into a single linear map which performs the operations on the compressed representation. Step 5, the rounding step, is irreversible and ignored by decompression. Step 6, the entropy coding, is a nonlinear and its form is computed from the data directly, so it is difficult to work with this representation. We define the JPEG Transform Domain as the output of Step 4 in the JPEG encoding algorithm. Inputs to the algorithms described here will be JPEGs after reversing the entropy coding.

A key observation of the JPEG algorithm, and the foundation of most compressed domain processing methods \cite{chang1992video, chang1993new, natarajan1995fast, shen1995inner, shen1996direct, shen1998block, smith1993algorithms, smith1994fast} is that steps 1-4 of the JPEG compression algorithm are linear maps, so they can be composed, along with other linear operations, into a single linear map which performs the operations on the compressed representation. Step 5, the rounding step, is irreversible and ignored by decompression. Step 6, the entropy coding, is a nonlinear map and its form is computed from the data directly, so it is difficult to work with this representation. We define the JPEG Transform Domain as the output of Step 4 in the JPEG encoding algorithm. This is a standard convention of compressed domain processing. Inputs to the algorithms described here will be JPEGs after reversing the entropy coding.

Formally, we model a single plane image as the type (0, 2) tensor $I \in H^*\otimes W^*$ where $H$ and $W$ are vector spaces and $*$ denotes the dual space. We always use the standard orthonormal basis for these vector spaces which allows the free raising and lowering of indices without the use of a metric tensor.

We define the JPEG transform $J \in H \otimes W \otimes X^*\otimes Y^*\otimes K^*$. $J$ represents a linear map $J: H^*\otimes W^*\rightarrow X^*\otimes Y^*\otimes K^*$which is computed as (in Einstein notation)

We define the JPEG transform as the type (2, 3) tensor $J \in H \otimes W \otimes X^*\otimes Y^*\otimes K^*$. $J$ represents a linear map $J: H^*\otimes W^*\rightarrow X^*\otimes Y^*\otimes K^*$and is computed as (in Einstein notation)

\begin{equation}

I'_{xyk} = J^{hw}_{xyk}I_{hw}

\end{equation}

and we say that $I'$ is the representation of $I$ in the JPEG transform domain. The indices $h,w$ give pixel position, $x,y$ give block position, and $k$ gives the offset into the block.

We say that $I'$ is the representation of $I$ in the JPEG transform domain. The indices $h,w$ give pixel position, $x,y$ give block position, and $k$ gives the offset into the block.

The form of $J$ is constructed from the JPEG compression steps listed in the previous section. Let the linear map $B: H^*\otimes W^*\rightarrow X^*\otimes Y^*\otimes I^*\otimes J^*$ be defined as

In this work we showed how to formulate deep residual learning in the JPEG transform domain, and that it provides a notable performance benefit in terms of processing time per image. Our method uses expresses convolutions as linear maps \cite{smith1994fast} and introduces a novel approximation technique for ReLu. We showed that the approximation can achieve highly performant results with little impact on classification accuracy.

In this work we showed how to formulate deep residual learning in the JPEG transform domain, and that it provides a notable performance benefit in terms of processing time per image. Our method expresses convolutions as linear maps \cite{smith1994fast} and introduces a novel approximation technique for ReLu. We showed that the approximation can achieve highly performant results with little impact on classification accuracy.

Future work should focus on two main points. The first is efficiency of representation. Our linear maps take up more space than spatial domain convolutions. This makes it hard to scale the networks to datasets with large image sizes. Secondly, library support in commodity deep learning libraries for some of the features required by this algorithm are lacking. As of this writing, true sparse tensor support is missing in all of PyTorch \cite{paszke2017automatic}, TensorFlow \cite{tensorflow2015-whitepaper}, and Caffe \cite{jia2014caffe}, with these tensors being represented as coordinate lists which are known to be highly non-performant. Additionally, the \texttt{einsum} function for evaluating multilinear expressions is not fully optimized in these libraries when compared to the speed of convolutions in libraries like CuDNN \cite{chetlur2014cudnn}.

@@ -4,11 +4,11 @@ We give experimental evidence for the efficacy of our method, starting with a di

\subsection{Network Architectures and Datasets}

Since we are concerned with reproducing the inference results of spatial domain networks, we choose the MNIST \cite{lecun1998mnist} and CIFAR-10/100 \cite{krizhevsky2009learning} datasets since they are easy to work with. The MNIST images are padded to $32\times32$ to ensure an even number of JPEG blocks. Our network architecture is shown in Figure \ref{fig:na}. The classification network consists of three residual blocks with the final two performing downsampling so that the final feature map consists of a single JPEG block.

Since we are concerned with reproducing the inference results of spatial domain networks, we choose the MNIST \cite{lecun1998mnist} and CIFAR-10/100 \cite{krizhevsky2009learning} datasets since they are easy to work with. The MNIST images are padded to $32\times32$ to ensure an even number of JPEG blocks. Our network architecture is shown in Figure \ref{fig:na}. The classification network consists of three residual blocks with the final two performing downsampling so that the final feature map consists of a single JPEG block. The goal of this architecture is not to get high accuracy, but rather to serve as a point of comparison for the spatial and JPEG algorithms.

\caption{Throughput. The JPEG model has a more complex gradient which limits speed improvement during training. Inference, however, sees considerably higher throughput.}

\label{fig:rt}

\label{fig:tp}

\end{figure}

Finally, we show the throughput for training and testing. For this we test on all three datasets by training and testing a spatial model and training and testing a JPEG model and measuring the time taken. This is then converted to an average throughput measurement. The experiment is performed on an NVIDIA Pascal GPU with a batch size of 40 images. The results, shown in Figure \ref{fig:rt}, show that the JPEG model is able to outperform the spatial model in all cases, but that the performance on training is still limited. This is likely caused by the more complex gradient created by the convolution and ReLu operations. At inference time, however, performance is greatly improved over the spatial model.

\ No newline at end of file

Finally, we show the throughput for training and testing. For this we test on all three datasets by training and testing a spatial model and training and testing a JPEG model and measuring the time taken. This is then converted to an average throughput measurement. The experiment is performed on an NVIDIA Pascal GPU with a batch size of 40 images. The results, shown in Figure \ref{fig:tp}, show that the JPEG model is able to outperform the spatial model in all cases, but that the performance on training is still limited. This is caused by the more complex gradient created by the convolution and ReLu operations. At inference time, however, performance is greatly improved over the spatial model.

where $c$ and $c'$ index the input and output channels of the image respectively. This algorithm skips the overhead of computing the spatial domain map explicitly and depends only on the batch convolution operation which is available in all GPU accelerated deep learning libraries. Further, the map can be precomputed to speed up inference by avoiding repeated applications of the convolution.

where $c$ and $c'$ index the input and output channels of the image respectively. This algorithm skips the overhead of computing the spatial domain map explicitly and depends only on the batch convolution operation which is available in all GPU accelerated deep learning libraries. Further, the map can be precomputed to speed up inference by avoiding repeated applications of the convolution. At training time, the gradient of the compression and decompression operators is computed and used to find the gradient of the original convolution filter with respect to the previous layers error, then the map $\Xi$ is updated using the new filter. So, while inference efficiency of the convolution operation is greatly improved, training efficiency is limited by the more complex update. We show in Section \ref{sec:expeff} that the training throughput is still higher than the equivalent spatial domain model.

\caption{Example of ASM ReLu on an $8\times8$ block. Green pixels are negative, red pixels are positive, and blue pixels are zero. Left: original image. Middle: ReLu. Right: ReLu approximation using ASM.}

\caption{Example of ASM ReLu on an $8\times8$ block. Green pixels are negative, red pixels are positive, and blue pixels are zero. Left: original image. Middle: ReLu. Right: ReLu approximation using ASM with 6 spatial frequencies.}

\label{fig:asm}

\end{figure}

...

...

@@ -46,7 +46,7 @@ For the $8 \times 8$ DCT used in the JPEG algorithm, this gives 15 spatial frequ

If we now compute the piecewise linear function on this approximation directly there are two major problems. The first is that, although the form of the approximation is motivated by a least squares minimization, it is by no means guaranteed to reproduce the original values of any of the pixels. The second is that this gives the value of the function in the spatial domain, and to continue using a JPEG domain network we would need to compress the result which adds computational overhead.

To solve the first problem we examine ranges that the linear pieces fall into. The larger these ranges are, the more likely we are to have produced a value in the correct range in our approximation. Further, since the lowest $k$ frequencies minimize the least squared error, the higher the frequency, the less likely it is to push a pixel value out of the given range. With this motivation, we can produce a binary mask for each piece of the function. The linear pieces can then be applied directly to the DCT coefficients, and then multiplied by the mask and summed to give the final result. This preserves all pixel values. The only errors would be in the mask which would result in the wrong linear piece being applied. This is the fundamental idea behind the Approximated Spatial Masking (ASM) technique.

To solve the first problem we examine the intervals that the linear pieces fall into. The larger these intervals are, the more likely we are to have produced a value in the correct interval in our approximation. Further, since the lowest $k$ frequencies minimize the least squared error, the higher the frequency, the less likely it is to push a pixel value out of the given range. With this motivation, we can produce a binary mask for each piece of the function. The linear pieces can then be applied directly to the DCT coefficients, and then multiplied by their masks and summed to give the final result. This preserves all pixel values. The only errors would be in the mask which would result in the wrong linear piece being applied. This is the fundamental idea behind the Approximated Spatial Masking (ASM) technique.

The final problem is that we now have a mask in the spatial domain, but the original image is in the DCT domain. There is a well known algorithm for pixelwise multiplication of two DCT images \cite{smith1993algorithms}, but it would require the mask to also be in the DCT domain. Fortunately, there is a straightforward solution that is a result of the multilinear analysis given in Section \ref{sec:backjlm}. Consider the bilinear map

\begin{equation}

...

...

@@ -56,7 +56,7 @@ that takes a DCT block, $F$, and a mask $M$, and produces the masked DCT block b

\begin{enumerate}

\item Take the inverse DCT of F: $I_{ij}= D^{\alpha\beta}_{ij}F_{\alpha\beta}$

\item Pixelwise multiply $I'_{ij}= I_{ij}M_{ij}$

\item Take the DCT of $I$, $F'_{\alpha'\beta'}= D^{ij}_{\alpha'\beta'}I'_{ij}$.

\item Take the DCT of $I'$, $F'_{\alpha'\beta'}= D^{ij}_{\alpha'\beta'}I'_{ij}$.

\end{enumerate}

Since these three steps are linear or bilinear maps, they can be combined

\begin{equation}

...

...

@@ -83,12 +83,18 @@ We call the function $\nnm(x)$ the nonnegative mask of $x$. This is our binary m

\begin{equation}

r(x) = \nnm(x)x

\end{equation}

This new function can be computed efficiently from fewer spatial frequencies with much higher accuracy since only the sign of the original function needs to be correct. Figure \ref{fig:asm} gives an example of this algorithm on a random block, and pseudocode is given in the supplementary material. To extend this method from the DCT domain to the JPEG transform domain, the rest of the missing JPEG tensor can simply be applied to $H$.

This new function can be computed efficiently from fewer spatial frequencies with much higher accuracy since only the sign of the original function needs to be correct. Figure \ref{fig:asm} gives an example of this algorithm on a random block, and pseudocode is given in the supplementary material. To extend this method from the DCT domain to the JPEG transform domain, the rest of the missing JPEG tensor can simply be applied as follows:

Since the operation is the same for each block, and there are no interactions between blocks, the blocking tensor $B$ can be skipped.

\subsection{Batch Normalization}

\label{sec:jdrbn}

Batch normalization \cite{ioffe2015batch} has a simple and efficient formulation in the JPEG domain. Recall that batch normalization defines two learnable parameters: $\gamma$ and $\beta$. A given feature map $I$ is first centered and then normalized over the batch, then scaled by $\gamma$ and translated by $\beta$. The full is

Batch normalization \cite{ioffe2015batch} has a simple and efficient formulation in the JPEG domain. Recall that batch normalization defines two learnable parameters: $\gamma$ and $\beta$. A given feature map $I$ is first centered and then normalized over the batch, then scaled by $\gamma$ and translated by $\beta$. The full formula is

@@ -115,7 +121,7 @@ To apply $\gamma$ and the variance, we use scalar multiplication. This follows d

\end{equation}

For scalar addition to apply $\beta$, note that since the (0,0) coefficient is the mean, and adding $\beta$ to every pixel in the image is equivalent to raising the mean by $\beta$, we can simply add $\beta$ to the (0,0) coefficient.

To extend this to JPEG is simple. For an $8\times8$ block, the proportionality constant for the (0,0) coefficient becomes $\frac{1}{2\sqrt{2\times8}}=\frac{1}{8}$. For this reason, many quantization matrices use $8$ as the (0,0) quantization coefficient. This means that the 0th block entry for a block does not need any proportionality constant, it stores exactly the mean. This means for adding $\beta$, we can simply set the 0th position to $\beta$ without performing additional operations. The other operations are unaffected.

To extend this to JPEG is simple. For an $8\times8$ block, the proportionality constant for the (0,0) coefficient becomes $\frac{1}{2\sqrt{2\times8}}=\frac{1}{8}$. For this reason, many quantization matrices use $8$ as the (0,0) quantization coefficient. This means that the 0th block entry for a block does not need any proportionality constant, it stores exactly the mean. So for adding $\beta$, we can simply set the 0th position to $\beta$ without performing additional operations. The other operations are unaffected.

@@ -12,4 +12,4 @@ Compressed domain machine learning grew out of the work in the mid 90s. Arman \e

\subsection{Deep Learning in the Compressed Domain}

Because deep networks are non-linear maps, deep learning has received limited study in the compressed domain. Ghosh and Chellappa \cite{ghosh2016deep} use a DCT as part of their network's first layer and show that it speeds up convergence for training. Wu \etal\cite{wu2018compressed} formulate a deep network for video action recognition that uses a separate network for i-frames and p-frames. Since the p-frame network functions on raw motion vectors and error residuals it is considered compressed domain processing, although it works in the spatial domain and not the quantized frequency domain as in this work. Wu \etal show a significant efficiency advantage compared to traditional 3D convolution architectures, which they attribute to the p-frame data being a minimal representation of the video motion. Gueguen \etal\cite{gueguen_2018_ICLR} formulate a traditional ResNet that operates on DCT coefficients directly instead of pixels, \eg the DCT coefficients are fed to the network. They show that learning converges faster on this input, further motivating the JPEG representation.

\ No newline at end of file

Because deep networks are non-linear maps, deep learning has received limited study in the compressed domain. Ghosh and Chellappa \cite{ghosh2016deep} use a DCT as part of their network's first layer and show that it speeds up convergence for training. This is extended by Ulicny \etal\cite{ulicny2018harmonic} to create separate filters for each DCT basis function. Wu \etal\cite{wu2018compressed} formulate a deep network for video action recognition that uses a separate network for i-frames and p-frames. Since the p-frame network functions on raw motion vectors and error residuals it is considered compressed domain processing, although it works in the spatial domain and not the quantized frequency domain as in this work. Wu \etal show a significant efficiency advantage compared to traditional 3D convolution architectures, which they attribute to the p-frame data being a minimal representation of the video motion. Gueguen \etal\cite{gueguen_2018_ICLR} formulate a traditional ResNet that operates on DCT coefficients directly instead of pixels, \eg the DCT coefficients are fed to the network. They show that learning converges faster on this input, further motivating the JPEG representation.