We introduce a general method of performing Residual Network inference and learning in the JPEG transform domain that allows the network to consume compressed images as input. Our formulation leverages the linearity of the JPEG transform to redefine convolution and batch normalization with a tune-able numerical approximation for ReLu. The result is mathematically equivalent to the spatial domain network up to the ReLu approximation accuracy. A formulation for image classification and a model conversion algorithm for spatial domain networks are given as examples of the method. We show that the sparsity of the JPEG format allows for faster processing of the images with little to no penalty in the network accuracy.
We introduce a general method of performing Residual Network inference and learning in the JPEG transform domain that allows the network to consume compressed images as input. Our formulation leverages the linearity of the JPEG transform to redefine convolution and batch normalization with a tune-able numerical approximation for ReLu. The result is mathematically equivalent to the spatial domain network up to the ReLu approximation accuracy. A formulation for image classification and a model conversion algorithm for spatial domain networks are given as examples of the method. We show that the sparsity of the JPEG format allows for faster processing of images with little to no penalty in the network accuracy.
@@ -15,9 +15,9 @@ The JPEG compression algorithm is defined as the following steps.
\item Run-length code and entropy code the vectors
\end{enumerate}
This process is repeated independently for each image plane. In most cases, the original image is transformed from the RGB color space to YUV and chroma subsampling is applied since the human visual system is less sensitive to small color changes than to small brightness changes \cite{winkler2001vision}. The decompression algorithm is the inverse process. Note that the rounding step (step 5) must be skipped during decompression, this is the step in JPEG compression where information is lost and is the cause of artifacting in decompressed JPEG images.
This process is repeated independently for each image plane. In most cases, the original image is transformed from the RGB color space to YUV and chroma subsampling is applied since the human visual system is less sensitive to small color changes than to small brightness changes \cite{winkler2001vision}. The decompression algorithm is the inverse process. Note that the rounding step (step 5) must be skipped during decompression. This is the step in JPEG compression where information is lost and is the cause of artifacts in decompressed JPEG images.
The magnitude of the information loss can be tuned using the quantization coefficients. If a larger coefficient is applied in step 4, then the result will be closer to 0 which increases its likelihood of being dropped altogether during rounding. In this way, the JPEG transform forces sparsity on the representation, which why it compresses the image data so well. This is coupled with the tendency of the DCT to push the magnitude of the coefficients into the upper left corner (the DC coefficient and the lowest spatial frequency) to result in high spatial frequencies being dropped. Not only do these high spatial frequencies contribute less response to the human visual system, but they are also the optimal set to drop for a least squares reconstruction of the original image:
The magnitude of the information loss can be tuned using the quantization coefficients. If a larger coefficient is applied in step 4, then the result will be closer to 0 which increases its likelihood of being dropped altogether during rounding. In this way, the JPEG transform forces sparsity on the representation, which is why it compresses image data so well. This is coupled with the tendency of the DCT to push the magnitude of the coefficients into the upper left corner (the DC coefficient and the lowest spatial frequency) resulting in high spatial frequencies being dropped. Not only do these high spatial frequencies contribute less response to the human visual system, but they are also the optimal set to drop for a least squares reconstruction of the original image:
\begin{theorem}[DCT Least Squares Approximation Theorem]
Given a set of $N$ samples of a signal $X =\{x_0, ... x_N\}$, let $Y =\{y_0, ... y_N\}$ be the DCT coefficients of $X$. Then, for any $1\leq m \leq N$, the approximation
...
...
@@ -38,15 +38,15 @@ Theorem \ref{thm:dctls} states that a reconstruction using the $m$ lowest spatia
\label{sec:backjlm}
A key observation of the JPEG algorithm, and the foundation of most compressed domain processing methods \cite{chang1992video, chang1993new, natarajan1995fast, shen1995inner, shen1996direct, shen1998block, smith1993algorithms, smith1994fast} is that steps 1-4 of the JPEG compression algorithm are linear maps, so they can be composed, along with other linear operations, into a single linear map which performs the operations on the compressed representation. Step 5, the rounding step, cannot be undone and Step 6, the entropy coding, is nonlinear and therefore must be undone. We define the JPEG Transform Domain as the output of Step 4 in the JPEG encoding algorithm. Inputs the the algorithms described here will be JPEGs after reversing the entropy coding.
A key observation of the JPEG algorithm, and the foundation of most compressed domain processing methods \cite{chang1992video, chang1993new, natarajan1995fast, shen1995inner, shen1996direct, shen1998block, smith1993algorithms, smith1994fast} is that steps 1-4 of the JPEG compression algorithm are linear maps, so they can be composed, along with other linear operations, into a single linear map which performs the operations on the compressed representation. Step 5, the rounding step, is irreversible and ignored by decompression. Step 6, the entropy coding, is a nonlinear and its form is computed from the data directly, so it is difficult to work with this representation. We define the JPEG Transform Domain as the output of Step 4 in the JPEG encoding algorithm. Inputs to the algorithms described here will be JPEGs after reversing the entropy coding.
Formally, we model a single plane image as the type (0, 2) tensor $I \in H^*\otimes W^*$ where $H$ and $W$ are vector spaces and $*$ denotes the dual space. We always use the standard orthonormal basis for these vector spaces, this is important as it allows the free raising and lowering of indices without the use of a metric tensor.
Formally, we model a single plane image as the type (0, 2) tensor $I \in H^*\otimes W^*$ where $H$ and $W$ are vector spaces and $*$ denotes the dual space. We always use the standard orthonormal basis for these vector spaces which allows the free raising and lowering of indices without the use of a metric tensor.
We define the JPEG transform $J \in H \otimes W \otimes X^*\otimes Y^*\otimes K^*$. $J$ represents a linear map $J: H^*\otimes W^*\rightarrow X^*\otimes Y^*\otimes K^*$ which is computed as (in Einstein notation)
\begin{equation}
I'_{xyk} = J^{hw}_{xyk}I_{hw}
\end{equation}
and we say that $I'$ is the representation of $I$ in the JPEG transform domain. In the above equation, the indices $h,w$ give the pixel position, the indices $x,y$ give the block position, and the index$k$ gives the offset into the block.
and we say that $I'$ is the representation of $I$ in the JPEG transform domain. The indices $h,w$ give pixel position, $x,y$ give block position, and$k$ gives the offset into the block.
The form of $J$ is constructed from the JPEG compression steps listed in the previous section. Let the linear map $B: H^*\otimes W^*\rightarrow X^*\otimes Y^*\otimes I^*\otimes J^*$ be defined as
then $B$ can be used to break the image represented by $I$ into blocks of a given size such that the first two indices $x,y$ index the block position and the last two indices $i,j$ index the offset into the block.
Next. let the linear map $D: I^*\otimes J^*\rightarrow A^*\otimes B^*$ be defined as
Next, let the linear map $D: I^*\otimes J^*\rightarrow A^*\otimes B^*$ be defined as
@@ -70,9 +70,9 @@ then $Z$ creates the zigzag ordered vectors. Finally, let $S: \Gamma^* \rightarr
\begin{equation}
S^\gamma_k = \frac{1}{q_k}
\end{equation}
where $q_k$ is a quantization coefficient, $S$ can be used to scale the vector entries by their quantization coefficients.
where $q_k$ is a quantization coefficient. This scales the vector entries by the quantization coefficients.
With linear maps for each step of the JPEG transform, we can then apply them to each other to create the $J$ tensor that was described at the beginning of this section
With linear maps for each step of the JPEG transform, we can then create the $J$ tensor described at the beginning of this section
noting that, for all tensors other than $\widetilde{S}$, we have freely raised and lowered indices without the use of a metric tensor since we consider only the standard orthonormal basis, as stated earlier.
Next consider a linear map $C: H^*\otimes W^*\rightarrow H^*\otimes W^*$ which performs an arbitrary pixel manipulation on an image plane $I$. To apply this mapping to a JPEG image $I'$, we would first decompress the image, apply $C$ to the result, then compress that result to get the final JPEG. Since compressing is an application of $J$ and decompressing is an application of $\widetilde{J}$, we can form a new linear map $\Xi: X^*\otimes Y^*\otimes K^*\rightarrow X^*\otimes Y^*\otimes K^*$ as
Next consider a linear map $C: H^*\otimes W^*\rightarrow H^*\otimes W^*$ which performs an arbitrary pixel manipulation on an image plane $I$. To apply this mapping to a JPEG image $I'$, we first decompress the image, apply $C$ to the result, then compress that result to get the final JPEG. Since compressing is an application of $J$ and decompressing is an application of $\widetilde{J}$, we can form a new linear map $\Xi: X^*\otimes Y^*\otimes K^*\rightarrow X^*\otimes Y^*\otimes K^*$ as
which applies $C$ in the JPEG transform domain. There are two important points to note about $\Xi$. The first is that, although it encapsulates decompression, applying $C$ and compressing, it uses far fewer operations than doing these processes separately since the coefficients are multiplied out. The second is that it is mathematically equivalent to performing $C$ on the decompressed image and compressing the result, it is not an approximation.
\ No newline at end of file
which applies $C$ in the JPEG transform domain. There are two important points to note about $\Xi$. The first is that, although it encapsulates decompression, applying $C$ and compressing, it uses far fewer operations than doing these processes separately since the coefficients are multiplied out. The second is that it is mathematically equivalent to performing $C$ on the decompressed image and compressing the result. It is not an approximation.
In this work we showed how to formulate deep residual learning in the JPEG transform domain, and we showed that it provides a notable performance benefit in terms of processing time for each image. Our method uses a model of convolutions as a linear map \cite{smith1994fast} and introduces a novel approximation technique for ReLu which, to our knowledge, is the first attempt at defining a non-linear function in the JPEG transform domain. We showed that the approximation can achieve highly performant results with little impact on the overall classification accuracy.
In this work we showed how to formulate deep residual learning in the JPEG transform domain, and that it provides a notable performance benefit in terms of processing time per image. Our method uses expresses convolutions as linear maps \cite{smith1994fast} and introduces a novel approximation technique for ReLu. We showed that the approximation can achieve highly performant results with little impact on classification accuracy.
Future work should focus on two main points. The first is efficiency of representation. Our linear maps take up more space, especially when stored in dense tensor data structures, than spatial domain convolutions. This makes it hard to scale the networks to datasets with large image sizes. Secondly, library support in commodity deep learning libraries for some of the features required by this algorithm are lacking. As of this writing, true sparse tensor support is missing in all of PyTorch \cite{paszke2017automatic}, TensorFlow \cite{tensorflow2015-whitepaper}, and Caffe \cite{jia2014caffe}, with these tensors being represented as coordinate lists which are known to be highly non-performant. Additionally, the \texttt{einsum} function for evaluating multilinear expressions is not fully optimized in these libraries when compared to the speed of convolutions in libraries like CuDNN \cite{chetlur2014cudnn}.
\ No newline at end of file
Future work should focus on two main points. The first is efficiency of representation. Our linear maps take up more space than spatial domain convolutions. This makes it hard to scale the networks to datasets with large image sizes. Secondly, library support in commodity deep learning libraries for some of the features required by this algorithm are lacking. As of this writing, true sparse tensor support is missing in all of PyTorch \cite{paszke2017automatic}, TensorFlow \cite{tensorflow2015-whitepaper}, and Caffe \cite{jia2014caffe}, with these tensors being represented as coordinate lists which are known to be highly non-performant. Additionally, the \texttt{einsum} function for evaluating multilinear expressions is not fully optimized in these libraries when compared to the speed of convolutions in libraries like CuDNN \cite{chetlur2014cudnn}.
We give experimental evidence for the efficacy of our method, starting with a brief discussion of the architectures we use and the datasets for experimentation. We use model conversion as a sanity check, ensuring that the JPEG model with exact ReLu matches exactly the testing accuracy of a spatial domain model. Next we show how the ReLu approximation accuracy effects the overall network performance. We conclude by showing the training and testing time advantage of our method.
We give experimental evidence for the efficacy of our method, starting with a discussion of the architectures we use and the datasets. We use model conversion as a sanity check, ensuring that the JPEG model with exact ReLu matches exactly the testing accuracy of a spatial domain model. Next we show how the ReLu approximation accuracy effects overall network performance. We conclude by showing the training and testing time advantage of our method.
\subsection{Network Architectures and Datasets}
Since we are concerned with reproducing the inference results of spatial domain networks, we choose the MNIST \cite{lecun1998mnist} and CIFAR-10/100 \cite{krizhevsky2009learning} datasets since they are easy to work with. The MNIST images are padded to $32\times32$before being used to ensure an even number of JPEG blocks. Our network architecture is similarly simple is shown in Figure \ref{fig:na}. The classification network consists of three residual blocks with the final two performing downsampling so that the final feature map consists of a single JPEG block.
Since we are concerned with reproducing the inference results of spatial domain networks, we choose the MNIST \cite{lecun1998mnist} and CIFAR-10/100 \cite{krizhevsky2009learning} datasets since they are easy to work with. The MNIST images are padded to $32\times32$to ensure an even number of JPEG blocks. Our network architecture is shown in Figure \ref{fig:na}. The classification network consists of three residual blocks with the final two performing downsampling so that the final feature map consists of a single JPEG block.
\caption{Simple network architecture. $T$ indicates the batch size.}
\label{fig:na}
\end{figure}
\subsection{Model Conversion}
For this first experiment, we provide empirical evidence that the JPEG formulation presented in this paper is mathematically equivalent to spatial domain network. To show this, we train 100 spatial domain models on each of the three datasets and give their mean testing accuracies. When then use model conversion to transform the pretrained models to the JPEG domain and give the mean testing accuracies of the JPEG models. The images are losslessly JPEG compressed for input to the JPEG networks and the exact (15 spatial frequency) ReLu formulation is used. The result of this test is given in Table \ref{tab:mc}. Since the accuracy difference between the networks is extremely small, the deviation is also included.
For this first experiment, we show empirically that the JPEG formulation is mathematically equivalent to the spatial domain network. To show this, we train 100 spatial domain models on each of the three datasets and give their mean testing accuracies. We then use model conversion to transform the pretrained models to the JPEG domain and give the mean testing accuracies of the JPEG models. The images are losslessly JPEG compressed for input to the JPEG networks and the exact (15 spatial frequency) ReLu formulation is used. The result of this test is given in Table \ref{tab:mc}. Since the accuracy difference between the networks is extremely small, the deviation is also included.
\begin{table}[h]
\centering
...
...
@@ -34,10 +34,11 @@ For this first experiment, we provide empirical evidence that the JPEG formulati
\subsection{ReLu Approximation Accuracy}
\label{sec:exprla}
Next, we examine the impact of the ReLu approximation. We start by examining the raw error on individual $8\times8$ blocks. For this test, we take random $4\times4$ pixel blocks in the range $[-1, 1]$ and scale them to $8\times8$ using a box filter. Fully random $8\times8$ blocks do not accurately represent the statistics of real images and are known to be a worst case for the DCT transform. The $4\times4$ blocks allow for a large random sample size while still approximating real image statistics. We take 10 million such blocks and compute the average RMSE of our Approximated Spatial Masking (ASM) technique and compare it to computing ReLu directly on the approximation (APX). This test is repeated for all one to fifteen spatial frequencies. The result, shown in Figure \ref{fig:rba} shows that our ASM method gives a better approximation (lower RMSE) through the range of spatial frequencies.
Next, we examine the impact of the ReLu approximation. We start by examining the raw error on individual $8\times8$ blocks. For this test, we take random $4\times4$ pixel blocks in the range $[-1, 1]$ and scale them to $8\times8$ using a box filter. Fully random $8\times8$ blocks do not accurately represent the statistics of real images and are known to be a worst case for the DCT transform. The $4\times4$ blocks allow for a large random sample size while still approximating real image statistics. We take 10 million blocks and compute the average RMSE of our ASM technique and compare it to computing ReLu directly on the approximation (APX). This test is repeated for all one to fifteen spatial frequencies. The result, shown in Figure \ref{fig:rba} shows that our ASM method gives a better approximation (lower RMSE) through the range of spatial frequencies.
\begin{figure*}
\centering
\caption{ReLu accuracy results.}
\begin{subfigure}{0.33\textwidth}
\captionsetup{width=.8\linewidth}
\centering
...
...
@@ -61,11 +62,11 @@ Next, we examine the impact of the ReLu approximation. We start by examining the
\end{subfigure}
\end{figure*}
This test provides a strong motivation for the ASM method, so we move on to testing it in the model conversion setting. For this test, we again train 100 spatial domain models and then perform model conversion with the ReLu layers ranging from 1-15 spatial frequencies. We again compare our ASM method with the APX method. The result is given in Figure \ref{fig:ra}, again the ASM method outperforms the APX method.
This test provides a strong motivation for the ASM method, so we move on to testing it in the model conversion setting. For this test, we again train 100 spatial domain models and then perform model conversion with the ReLu layers ranging from 1-15 spatial frequencies. We again compare our ASM method with the APX method. The result is given in Figure \ref{fig:ra}. Again the ASM method outperforms the APX method.
As a final test, we show that if the models are trained in the JPEG domain, the CNN weights will actually learn to cope with the approximation and fewer spatial frequencies are required to get good accuracy. We again compare ASM to APX in this setting. The result shown in Figure \ref{fig:rt} shows that the ASM method again outperforms the APX method and that the network weights have learned to cope with the approximation.
As a final test, we show that if the models are trained in the JPEG domain, the CNN weights will actually learn to cope with the approximation and fewer spatial frequencies are required for good accuracy. The result in Figure \ref{fig:rt} shows that the ASM method again outperforms the APX method and that the network weights have learned to cope with the approximation.
\subsection{Efficiency of Training and Testing}
...
...
@@ -76,4 +77,4 @@ As a final test, we show that if the models are trained in the JPEG domain, the
\label{fig:rt}
\end{figure}
Finally, we show the throughput for training and testing. For this we test on all three datasets by training and testing a spatial model and training and testing a JPEG model and measuring the time taken. This is then converted to an average throughput measurement. The experiment is performed on an NVIDIA Pascal GPU with a batch size of 40 images. The results, shown in Figure \ref{fig:rt}, show that the JPEG model is able to outperform the spatial model in all cases, but that the performance on training is still limited. This is likely because of the more complex gradient created by the convolution and ReLu operations. At inference time, however, performance is greatly improved over the spatial model.
\ No newline at end of file
Finally, we show the throughput for training and testing. For this we test on all three datasets by training and testing a spatial model and training and testing a JPEG model and measuring the time taken. This is then converted to an average throughput measurement. The experiment is performed on an NVIDIA Pascal GPU with a batch size of 40 images. The results, shown in Figure \ref{fig:rt}, show that the JPEG model is able to outperform the spatial model in all cases, but that the performance on training is still limited. This is likely caused by the more complex gradient created by the convolution and ReLu operations. At inference time, however, performance is greatly improved over the spatial model.
@@ -4,7 +4,7 @@ The popularization of deep learning since the 2012 AlexNet \cite{krizhevsky2012i
This problem has been addressed many times in the literature. Batch normalization \cite{ioffe2015batch} is ubiquitous in modern networks to accelerate their convergence. Residual learning \cite{he2016deep} allows for much deeper networks to learn effective mappings without overfitting. Techniques such as pruning and weight compression \cite{han2015deep} are becoming more commonplace. As problems become even larger and more complex, these techniques are increasingly being relied upon for efficient training and inference.
We tackle this problem at the level of the image representation. The JPEG image compression algorithm is the most widespread image file format. Traditionally, the first step in using JPEGs for machine learning is to decompress them. We propose to skip this step and instead reformulate the ResNet architecture to perform its operations directly on compressed images. The goal is to produce a new network that is mathematically equivalent to the spatial domain network, but which operates on compressed images by composing the compression transform into the network weights, which can be done because they are both linear maps. Because the ReLu function is non-linear, we develop an approximation technique for it. This is a general method and, to our knowledge, is the first attempt at formulating a piecewise linear function in the transform domain presented in the literature.
We tackle this problem at the level of the image representation. JPEG is the most widespread image file format. Traditionally, the first step in using JPEGs for machine learning is to decompress them. We propose to skip this step and instead reformulate the ResNet architecture to perform its operations directly on compressed images. The goal is to produce a new network that is mathematically equivalent to the spatial domain network, but which operates on compressed images by including the compression transform into the network weights, which can be done because they are both linear maps. Because the ReLu function is non-linear, we develop an approximation technique for it. This is a general method and, to our knowledge, is the first attempt at formulating a piecewise linear function in the transform domain.
We briefly review prior work separated into three categories: compressed domain operations, machine learning in the compressed domain, and deep learning in the compressed domain.
We review prior work separated into three categories: compressed domain operations, machine learning in the compressed domain, and deep learning in the compressed domain.
\subsection{Compressed Domain Operations}
The expression of common operations in the compressed domain was an extremely active area of study in the late 80s and early 90s, motivated by the lack of computing power to quickly decompress, process, and recompress images and video. On the JPEG side, Smith and Rowe \cite{smith1993algorithms} formulate fast JPEG compatible algorithms for performing scalar and pixelwise addition and multiplication. This was extended by Shen and Sethi \cite{shen1995inner} to general blockwise operations and by Smith \cite{smith1994fast} to arbitrary linear maps. Natarajan and Vasudev \cite{natarajan1995fast} additionally formulate an extremely fast approximate algorithm for scaling JPEG images. On the MPEG side, Chang \etal\cite{chang1992video} introduce the basic algorithms for manipulating compressed video. Chang and Messerschmitt \cite{chang1993new} give a fast algorithm for decoding motion compensation before DCT which allows arbitrary video compositing operations to be performed.
The expression of common operations in the compressed domain was an extremely active area of study in the late 80s and early 90s, motivated by the lack of computing power to quickly decompress, process, and recompress images and video. For JPEG, Smith and Rowe \cite{smith1993algorithms} formulate fast JPEG compatible algorithms for performing scalar and pixelwise addition and multiplication. This was extended by Shen and Sethi \cite{shen1995inner} to general blockwise operations and by Smith \cite{smith1994fast} to arbitrary linear maps. Natarajan and Vasudev \cite{natarajan1995fast} additionally formulate an extremely fast approximate algorithm for scaling JPEG images. For MPEG, Chang \etal\cite{chang1992video} introduce the basic algorithms for manipulating compressed video. Chang and Messerschmitt \cite{chang1993new} give a fast algorithm for decoding motion compensation before DCT which allows arbitrary video compositing operations to be performed.
\subsection{Machine Learning in the Compressed Domain}
...
...
@@ -12,4 +12,4 @@ Compressed domain machine learning grew out of the work in the mid 90s. Arman \e
\subsection{Deep Learning in the Compressed Domain}
Because deep learning in particular is a non-linear map, it has received limited study in the compressed domain. Ghosh and Chellappa \cite{ghosh2016deep} use a DCT as part of their network's first layer and show that it speeds up convergence for training. Wu \etal\cite{wu2018compressed} formulate a deep network for video action recognition that uses a separate network for i-frames and p-frames. Since the p-frame network functions on raw motion vectors and error residuals it is considered compressed domain processing, although it works in the spatial domain and not the quantized frequency domain as in this work. Wu \etal show a significant efficiency advantage compared to traditional 3D convolution architectures, which they attribute to the p-frame data being a minimal representation of the video motion. Gueguen \etal\cite{gueguen_2018_ICLR} formulate a traditional ResNet that operates on DCT coefficients directly instead of pixels, \eg the DCT coefficients are fed to the network. They show that learning is able to converge faster on this input, further motivating the JPEG representation.
\ No newline at end of file
Because deep networks are non-linear maps, deep learning has received limited study in the compressed domain. Ghosh and Chellappa \cite{ghosh2016deep} use a DCT as part of their network's first layer and show that it speeds up convergence for training. Wu \etal\cite{wu2018compressed} formulate a deep network for video action recognition that uses a separate network for i-frames and p-frames. Since the p-frame network functions on raw motion vectors and error residuals it is considered compressed domain processing, although it works in the spatial domain and not the quantized frequency domain as in this work. Wu \etal show a significant efficiency advantage compared to traditional 3D convolution architectures, which they attribute to the p-frame data being a minimal representation of the video motion. Gueguen \etal\cite{gueguen_2018_ICLR} formulate a traditional ResNet that operates on DCT coefficients directly instead of pixels, \eg the DCT coefficients are fed to the network. They show that learning converges faster on this input, further motivating the JPEG representation.