@@ -38,15 +38,15 @@ Theorem \ref{thm:dctls} states that a reconstruction using the $m$ lowest spatia
\label{sec:backjlm}
A key observation of the JPEG algorithm, and the foundation of most compressed domain processing methods \cite{chang1992video, chang1993new, natarajan1995fast, shen1995inner, shen1996direct, shen1998block, smith1993algorithms, smith1994fast} is that steps 1-4 of the JPEG compression algorithm are linear maps, so they can be composed, along with other linear operations, into a single linear map which performs the operations on the compressed representation. Step 5, the rounding step, is irreversible and ignored by decompression. Step 6, the entropy coding, is a nonlinear and its form is computed from the data directly, so it is difficult to work with this representation. We define the JPEG Transform Domain as the output of Step 4 in the JPEG encoding algorithm. Inputs to the algorithms described here will be JPEGs after reversing the entropy coding.
A key observation of the JPEG algorithm, and the foundation of most compressed domain processing methods \cite{chang1992video, chang1993new, natarajan1995fast, shen1995inner, shen1996direct, shen1998block, smith1993algorithms, smith1994fast} is that steps 1-4 of the JPEG compression algorithm are linear maps, so they can be composed, along with other linear operations, into a single linear map which performs the operations on the compressed representation. Step 5, the rounding step, is irreversible and ignored by decompression. Step 6, the entropy coding, is a nonlinear map and its form is computed from the data directly, so it is difficult to work with this representation. We define the JPEG Transform Domain as the output of Step 4 in the JPEG encoding algorithm. This is a standard convention of compressed domain processing. Inputs to the algorithms described here will be JPEGs after reversing the entropy coding.
Formally, we model a single plane image as the type (0, 2) tensor $I \in H^*\otimes W^*$ where $H$ and $W$ are vector spaces and $*$ denotes the dual space. We always use the standard orthonormal basis for these vector spaces which allows the free raising and lowering of indices without the use of a metric tensor.
We define the JPEG transform $J \in H \otimes W \otimes X^*\otimes Y^*\otimes K^*$. $J$ represents a linear map $J: H^*\otimes W^*\rightarrow X^*\otimes Y^*\otimes K^*$ which is computed as (in Einstein notation)
We define the JPEG transform as the type (2, 3) tensor $J \in H \otimes W \otimes X^*\otimes Y^*\otimes K^*$. $J$ represents a linear map $J: H^*\otimes W^*\rightarrow X^*\otimes Y^*\otimes K^*$ and is computed as (in Einstein notation)
\begin{equation}
I'_{xyk} = J^{hw}_{xyk}I_{hw}
\end{equation}
and we say that $I'$ is the representation of $I$ in the JPEG transform domain. The indices $h,w$ give pixel position, $x,y$ give block position, and $k$ gives the offset into the block.
We say that $I'$ is the representation of $I$ in the JPEG transform domain. The indices $h,w$ give pixel position, $x,y$ give block position, and $k$ gives the offset into the block.
The form of $J$ is constructed from the JPEG compression steps listed in the previous section. Let the linear map $B: H^*\otimes W^*\rightarrow X^*\otimes Y^*\otimes I^*\otimes J^*$ be defined as
In this work we showed how to formulate deep residual learning in the JPEG transform domain, and that it provides a notable performance benefit in terms of processing time per image. Our method uses expresses convolutions as linear maps \cite{smith1994fast} and introduces a novel approximation technique for ReLu. We showed that the approximation can achieve highly performant results with little impact on classification accuracy.
In this work we showed how to formulate deep residual learning in the JPEG transform domain, and that it provides a notable performance benefit in terms of processing time per image. Our method expresses convolutions as linear maps \cite{smith1994fast} and introduces a novel approximation technique for ReLu. We showed that the approximation can achieve highly performant results with little impact on classification accuracy.
Future work should focus on two main points. The first is efficiency of representation. Our linear maps take up more space than spatial domain convolutions. This makes it hard to scale the networks to datasets with large image sizes. Secondly, library support in commodity deep learning libraries for some of the features required by this algorithm are lacking. As of this writing, true sparse tensor support is missing in all of PyTorch \cite{paszke2017automatic}, TensorFlow \cite{tensorflow2015-whitepaper}, and Caffe \cite{jia2014caffe}, with these tensors being represented as coordinate lists which are known to be highly non-performant. Additionally, the \texttt{einsum} function for evaluating multilinear expressions is not fully optimized in these libraries when compared to the speed of convolutions in libraries like CuDNN \cite{chetlur2014cudnn}.
@@ -4,7 +4,7 @@ We give experimental evidence for the efficacy of our method, starting with a di
\subsection{Network Architectures and Datasets}
Since we are concerned with reproducing the inference results of spatial domain networks, we choose the MNIST \cite{lecun1998mnist} and CIFAR-10/100 \cite{krizhevsky2009learning} datasets since they are easy to work with. The MNIST images are padded to $32\times32$ to ensure an even number of JPEG blocks. Our network architecture is shown in Figure \ref{fig:na}. The classification network consists of three residual blocks with the final two performing downsampling so that the final feature map consists of a single JPEG block.
Since we are concerned with reproducing the inference results of spatial domain networks, we choose the MNIST \cite{lecun1998mnist} and CIFAR-10/100 \cite{krizhevsky2009learning} datasets since they are easy to work with. The MNIST images are padded to $32\times32$ to ensure an even number of JPEG blocks. Our network architecture is shown in Figure \ref{fig:na}. The classification network consists of three residual blocks with the final two performing downsampling so that the final feature map consists of a single JPEG block. The goal of this architecture is not to get high accuracy, but rather to serve as a point of comparison for the spatial and JPEG algorithms.
\begin{figure}
\centering
...
...
@@ -70,11 +70,12 @@ As a final test, we show that if the models are trained in the JPEG domain, the
\caption{Throughput. The JPEG model has a more complex gradient which limits speed improvement during training. Inference, however, sees considerably higher throughput.}
\label{fig:rt}
\label{fig:tp}
\end{figure}
Finally, we show the throughput for training and testing. For this we test on all three datasets by training and testing a spatial model and training and testing a JPEG model and measuring the time taken. This is then converted to an average throughput measurement. The experiment is performed on an NVIDIA Pascal GPU with a batch size of 40 images. The results, shown in Figure \ref{fig:rt}, show that the JPEG model is able to outperform the spatial model in all cases, but that the performance on training is still limited. This is likely caused by the more complex gradient created by the convolution and ReLu operations. At inference time, however, performance is greatly improved over the spatial model.
\ No newline at end of file
Finally, we show the throughput for training and testing. For this we test on all three datasets by training and testing a spatial model and training and testing a JPEG model and measuring the time taken. This is then converted to an average throughput measurement. The experiment is performed on an NVIDIA Pascal GPU with a batch size of 40 images. The results, shown in Figure \ref{fig:tp}, show that the JPEG model is able to outperform the spatial model in all cases, but that the performance on training is still limited. This is caused by the more complex gradient created by the convolution and ReLu operations. At inference time, however, performance is greatly improved over the spatial model.
@@ -12,4 +12,4 @@ Compressed domain machine learning grew out of the work in the mid 90s. Arman \e
\subsection{Deep Learning in the Compressed Domain}
Because deep networks are non-linear maps, deep learning has received limited study in the compressed domain. Ghosh and Chellappa \cite{ghosh2016deep} use a DCT as part of their network's first layer and show that it speeds up convergence for training. Wu \etal\cite{wu2018compressed} formulate a deep network for video action recognition that uses a separate network for i-frames and p-frames. Since the p-frame network functions on raw motion vectors and error residuals it is considered compressed domain processing, although it works in the spatial domain and not the quantized frequency domain as in this work. Wu \etal show a significant efficiency advantage compared to traditional 3D convolution architectures, which they attribute to the p-frame data being a minimal representation of the video motion. Gueguen \etal\cite{gueguen_2018_ICLR} formulate a traditional ResNet that operates on DCT coefficients directly instead of pixels, \eg the DCT coefficients are fed to the network. They show that learning converges faster on this input, further motivating the JPEG representation.
\ No newline at end of file
Because deep networks are non-linear maps, deep learning has received limited study in the compressed domain. Ghosh and Chellappa \cite{ghosh2016deep} use a DCT as part of their network's first layer and show that it speeds up convergence for training. This is extended by Ulicny \etal\cite{ulicny2018harmonic} to create separate filters for each DCT basis function. Wu \etal\cite{wu2018compressed} formulate a deep network for video action recognition that uses a separate network for i-frames and p-frames. Since the p-frame network functions on raw motion vectors and error residuals it is considered compressed domain processing, although it works in the spatial domain and not the quantized frequency domain as in this work. Wu \etal show a significant efficiency advantage compared to traditional 3D convolution architectures, which they attribute to the p-frame data being a minimal representation of the video motion. Gueguen \etal\cite{gueguen_2018_ICLR} formulate a traditional ResNet that operates on DCT coefficients directly instead of pixels, \eg the DCT coefficients are fed to the network. They show that learning converges faster on this input, further motivating the JPEG representation.
% Pages are numbered in submission mode, and unnumbered in camera-ready
\ificcvfinal\pagestyle{empty}\fi
%\ificcvfinal\pagestyle{empty}\fi
\addbibresource{bibliography.bib}
...
...
@@ -115,6 +115,8 @@ We conclude by outlining in pseudocode the algorithms for the three layer operat
\EndFunction
\end{algorithmic}
\newpage
\captionof{algorithm}{Approximated Spatial Masking for ReLu. $F$ is a DCT domain block, $\phi$ is the desired maximum spatial frequencies, $N$ is the block size.}