We introduce a general method of performing Residual Network inference and learning in the JPEG transform domain that allows the network to consume compressed images as input. Our formulation leverages the linearity of the JPEG transform to redefine convolution and batch normalization with a tune-able numerical approximation for ReLu. The result is mathematically equivalent to the spatial domain network up to the ReLu approximation accuracy. A formulation for image classification and a model conversion algorithm for spatial domain networks are given as examples of the method. We show that the sparsity of the JPEG format allows for faster processing of images with little to no penalty in the network accuracy.
We introduce a general method of performing Residual Network inference and learning in the JPEG transform domain that allows the network to consume compressed images as input. Our formulation leverages the linearity of the JPEG transform to redefine convolution and batch normalization with a tune-able numerical approximation for ReLu. The result is mathematically equivalent to the spatial domain network up to the ReLu approximation accuracy. A formulation for image classification and a model conversion algorithm for spatial domain networks are given as examples of the method. We show skipping the costly decompression step allows for faster processing of images with little to no penalty in the network accuracy.
We say that $I'$ is the representation of $I$ in the JPEG transform domain. The indices $h,w$ give pixel position, $x,y$ give block position, and $k$ gives the offset into the block.
The form of $J$ is constructed from the JPEG compression steps listed in the previous section. Let the linear map $B: H^*\otimes W^*\rightarrow X^*\otimes Y^*\otimesI^*\otimes J^*$ be defined as
The form of $J$ is constructed from the JPEG compression steps listed in the previous section. Let the linear map $B: H^*\otimes W^*\rightarrow X^*\otimes Y^*\otimesM^*\otimes N^*$ be defined as
\begin{equation}
B^{hw}_{xyij} = \left\{\begin{array}{lr} 1 &\text{$h,w$ belongs in block $x,y$ at offset $i,j$}\\ 0 &\text{otherwise}\end{array}\right.
B^{hw}_{xymn} = \left\{\begin{array}{lr} 1 &\text{$h,w$ belongs in block $x,y$ at offset $m,n$}\\ 0 &\text{otherwise}\end{array}\right.
\end{equation}
then $B$ can be used to break the image represented by $I$ into blocks of a given size such that the first two indices $x,y$ index the block position and the last two indices $i,j$ index the offset into the block.
then $B$ can be used to break the image represented by $I$ into blocks of a given size such that the first two indices $x,y$ index the block position and the last two indices $m,n$ index the offset into the block.
Next, let the linear map $D: I^*\otimes J^*\rightarrow A^*\otimes B^*$ be defined as
Next, let the linear map $D: M^*\otimes N^*\rightarrow A^*\otimes B^*$ be defined as
then $D$ represents the 2D discrete forward (and inverse) DCT. Let $Z: A^*\otimes B^*\rightarrow\Gamma^*$ be defined as
where $V(u)$ is a normalizing scale factor. Then $D$ represents the 2D discrete forward (and inverse) DCT. Let $Z: A^*\otimes B^*\rightarrow\Gamma^*$ be defined as
\begin{equation}
Z^{\alpha\beta}_\gamma = \left\{\begin{array}{lr} 1 &\text{$\alpha, \beta$ is at $\gamma$ under zigzag ordering}\\ 0 &\text{otherwise}\end{array}\right.
...
...
@@ -75,7 +75,7 @@ where $q_k$ is a quantization coefficient. This scales the vector entries by the
With linear maps for each step of the JPEG transform, we can then create the $J$ tensor described at the beginning of this section
The inverse mapping also exists as a tensor $\widetilde{J}$ which can be defined using the same linear maps with the exception of $S$. Let $\widetilde{S}$ be
...
...
@@ -86,9 +86,8 @@ The inverse mapping also exists as a tensor $\widetilde{J}$ which can be defined
noting that, for all tensors other than $\widetilde{S}$, we have freely raised and lowered indices without the use of a metric tensor since we consider only the standard orthonormal basis, as stated earlier.
Next consider a linear map $C: H^*\otimes W^*\rightarrow H^*\otimes W^*$ which performs an arbitrary pixel manipulation on an image plane $I$. To apply this mapping to a JPEG image $I'$, we first decompress the image, apply $C$ to the result, then compress that result to get the final JPEG. Since compressing is an application of $J$ and decompressing is an application of $\widetilde{J}$, we can form a new linear map $\Xi: X^*\otimes Y^*\otimes K^*\rightarrow X^*\otimes Y^*\otimes K^*$ as
In this work we showed how to formulate deep residual learning in the JPEG transform domain, and that it provides a notable performance benefit in terms of processing time per image. Our method expresses convolutions as linear maps \cite{smith1994fast} and introduces a novel approximation technique for ReLu. We showed that the approximation can achieve highly performant results with little impact on classification accuracy.
Future work should focus on two main points. The first is efficiency of representation. Our linear maps take up more space than spatial domain convolutions. This makes it hard to scale the networks to datasets with large image sizes. Secondly, library support in commodity deep learning libraries for some of the features required by this algorithm are lacking. As of this writing, true sparse tensor support is missing in all of PyTorch \cite{paszke2017automatic}, TensorFlow \cite{tensorflow2015-whitepaper}, and Caffe \cite{jia2014caffe}, with these tensors being represented as coordinate lists which are known to be highly non-performant. Additionally, the \texttt{einsum} function for evaluating multilinear expressions is not fully optimized in these libraries when compared to the speed of convolutions in libraries like CuDNN \cite{chetlur2014cudnn}.
\ No newline at end of file
Future work should focus on two main points. The first is efficiency of representation. Our linear maps take up more space than spatial domain convolutions. This makes it hard to scale the networks to datasets with large image sizes. Secondly, library support in commodity deep learning libraries for some of the features required by this algorithm are lacking. As of this writing, true sparse tensor support is missing in all of PyTorch \cite{paszke2017automatic}, TensorFlow \cite{tensorflow2015-whitepaper}, and Caffe \cite{jia2014caffe}, with these tensors being represented as coordinate lists which are known to be highly non-performant. Additionally, the \texttt{einsum} function for evaluating multilinear expressions is not fully optimized in these libraries when compared to the speed of convolutions in libraries like CuDNN \cite{chetlur2014cudnn}, though we make use of the \texttt{opt\_einsum}\cite{G2018opt} tool to partially mitigate this.
@@ -13,4 +13,4 @@ The contributions of this work are as follows
\item A model conversion algorithm to apply pretrained spatial domain networks to JPEG images
\item Approximated Spatial Masking: the first general technique for application of piecewise linear functions in the transform domain
\end{enumerate}
By skipping the decompression step and by operating on the sparser compressed format, we show a notable increase in speed for training and inference.
\ No newline at end of file
By skipping the decompression step and by operating on the sparser compressed format, we show a notable increase in speed for testing and a marginal speed for training.