This process is repeated independently for each image plane. In most cases, the original image is transformed from the RGB color space to YUV and chroma subsampling is applied since the human visual system is less sensitive to small color changes than to small brightness changes \cite{winkler2001vision}. The decompression algorithm is the inverse process. Note that the rounding step (step 5) must be skipped during decompression, this is the step in JPEG compression where information is lost and is the cause of artifacting in decompressed JPEG images.

...

...

@@ -48,8 +40,8 @@ Theorem \ref{thm:dctls} states that a reconstruction using the $m$ lowest spatia

A key observation of the JPEG algorithm, and the foundation of most compressed domain processing methods \cite{chang1992video, chang1993new, natarajan1995fast, shen1995inner, shen1996direct, shen1998block, smith1993algorithms, smith1994fast} is that steps 1-4 of the JPEG compression algorithm are linear maps, so they can be composed, along with other linear operations, into a single linear map which performs the operations on the compressed representation. Step 5, the rounding step, cannot be undone and Step 6, the entropy coding, is nonlinear and therefore must be undone. We define the JPEG Transform Domain as the output of Step 4 in the JPEG encoding algorithm. Inputs the the algorithms described here will be JPEGs after reversing the entropy coding.

Formally, we model a single plane image as the type (0, 2) tensor $I \in V^*\otimes V^*$ for some vector space $V$ and its dual $V^*$. Note that we take all tensors as a tensor product combination of $V$ and $V^*$ without loss of generality. In real images, the dimensions have physical meaning (\eg width and height of the image) and will be of different sizes. The analysis presented in this work applies to any vector space although in real images we are dealing with floating point numbers. The basis of $V$ is always the standard orthonormal basis, this is important as it allows the free raising and lowering of indices without the use of a metric tensor.

We define the JPEG transform $J \in V \otimes V \otimes V^*\otimes V^*\otimes V^*$, a type (2,3) tensor. Then$J$ represents a linear map $J: V^*\otimes V^*\rightarrow V^*\otimes V^*\otimes V^*$ which is computed as (in Einstein notation)

Formally, we model a single plane image as the type (0, 2) tensor $I \in V^*\otimes V^*$ for some vector space $V$ and its dual $V^*$. Note that we take all tensors as a tensor product combination of $V$ and $V^*$ without loss of generality. In real images, the dimensions have physical meaning (\eg width and height of the image) and will be of different sizes. The basis of $V$ is always the standard orthonormal basis, this is important as it allows the free raising and lowering of indices without the use of a metric tensor.

We define the JPEG transform $J \in V \otimes V \otimes V^*\otimes V^*\otimes V^*$$J$ represents a linear map $J: V^*\otimes V^*\rightarrow V^*\otimes V^*\otimes V^*$ which is computed as (in Einstein notation)

\begin{equation}

I'_{xyk} = J^{sr}_{xyk}I_{sr}

...

...

@@ -98,7 +90,7 @@ Then

\end{equation}

noting that, for all tensors other than $\widetilde{S}$, we have freely raised and lowered indices without the use of a metric tensor on $V$ since we consider only the standard orthonormal basis, as stated earlier.

Next consider a linear map $C: V^*\otimes V^*\rightarrow V^*\otimes V^*$ which performs an arbitrary pixel manipulation on an image $I$. To apply this mapping to a JPEG image $I'$, we would first decompress the image, apply $C$ to the result, then compress that result to get the final JPEG. Since compressing is an application of $J$ and decompressing is an application of $\widetilde{J}$, we can form a new linear map $\Xi: V^*\otimes V^*\otimes V^*\rightarrow V^*\otimes V^*\otimes V^*$ as

Next consider a linear map $C: V^*\otimes V^*\rightarrow V^*\otimes V^*$ which performs an arbitrary pixel manipulation on an image plane $I$. To apply this mapping to a JPEG image $I'$, we would first decompress the image, apply $C$ to the result, then compress that result to get the final JPEG. Since compressing is an application of $J$ and decompressing is an application of $\widetilde{J}$, we can form a new linear map $\Xi: V^*\otimes V^*\otimes V^*\rightarrow V^*\otimes V^*\otimes V^*$ as

@@ -15,14 +15,41 @@ Since we are concerned with reproducing the inference results of spatial domain

\subsection{Model Conversion}

\TODO Show that on both datasets over several (maybe a hundred?) of both models trained in spatial domain, testing in spatial domain matches testing in JPEG domain.

For this first experiment, we provide empirical evidence that the JPEG formulation presented in this paper is mathematically equivalent to spatial domain network. To show this, we train 100 spatial domain models on each of three datasets and give their mean testing accuracies. When then use model conversion to transform the pretrained models to the JPEG domain and give the mean testing accuracies of the JPEG models. The images are losslessly JPEG compressed for input to the JPEG networks and the exact (15 spatial frequency) ReLu formulation is used. The result of this test is given in Table \ref{tab:mc}. Since the accuracy difference between the networks is extremely small, the deviation is also included.

\begin{table}[h]

\begin{tabular}{|r|l|l|l|}

\hline

Dataset & Spatial & JPEG & Deviation \\\hline

MNIST & 0.988 & 0.988 & 2.999e-06 \\\hline

CIFAR10 & 0.725 & 0.725 & 9e-06 \\\hline

CIFAR100 & 0.385 & 0.385 & 1e-06 \\\hline

\end{tabular}

\caption{Model conversion accuracies. Spatial and JPEG testing accuracies are the same to within floating point error.}

\label{tab:mc}

\end{table}

% CIFAR10:

\subsection{ReLu Approximation Accuracy}

\label{sec:exprla}

\TODO two things to show here. The first is a large scale test over $8\times8$ blocks that shows how the error of the ReLu approximation itself changes from 0-14 spatial frequencies (same thing from the notes). Second is a similar test but using (again like 100) models of both types for both datasets, show how the testing accuracy changes from 0-14 spatial frequencies. Finally, show that if you train and test like 100 models using 0-14 spatial frequencies, show that there's less error because the convolutional weights will learn to cope with the appx.

Next, we with to examine the impact of the ReLu approximation. We start by examining the raw error on individual $8\times8$ blocks. For this test, we take random $4\times4$ pixel blocks in the range $[-1, 1]$ and scale them to $8\times8$ using a box filter. Fully random $8\times8$ blocks do not accurately represent the statistics of real images and are known to be a worst case for the DCT transform. The $4\times4$ blocks allow for a large random sample size while still approximating real image statistics. We take 10 million such blocks and compute the average RMSE of our Approximated Spatial Masking (ASM) technique and compare it to computing ReLu directly on the approximation (APX). This test is repeated for all one to fifteen spatial frequencies. The result, shown in Figure \ref{fig:rba} shows that our ASM method gives a better approximation (lower RMSE) through the range of spatial frequencies.

\caption{ReLu blocks accuracy. Our ASM method consistently gives lower error than the naive approximation method.}

\label{fig:rba}

\end{figure}

This test provides a strong motivation for the ASM method, so we move on to testing it in the model conversion setting. For this test, we again train 100 spatial domain models and then perform model conversion with the ReLu layers ranging from 1-15 spatial frequencies. We again compare our ASM method with the APX method. The result is given in Figure \ref{fig:ra}, again the ASM method outperforms the APX method.

\caption{ReLu model conversion accuracy. ASM again outperforms the naive approximation. The spatial domain accuracy is given for each dataset with dashed lines.}

\label{fig:ra}

\end{figure}

As a final test, we show that if the models are trained in the JPEG domain, the CNN weights will actually learn to cope with the approximation and fewer spatial frequencies are required to get good accuracy. We again compare ASM to APX in this setting.

The ResNet architecture, generally, consists of blocks of four basic operations: Convolution (potentially strided), ReLu, Batch Normalization, and Component-wise addition, with the blocks terminating with a global average pooling operation \cite{he2016deep} before a fully connected layer performs the final classification. Our goal will be to develop JPEG domain equivalents to these five operations.

Before beginning this discussion, we must extend the image model presented in Section \ref{sec:backjlm}. Convolutional neural networks operate on batches of multichannel images. We redefine images as $I \in V^*\otimes V^*\otimes V^*\otimes V^*$ where the first index is the batch, the second is the channel, the third and fourth indices are the image height and width respectively.

The ResNet architecture, generally, consists of blocks of four basic operations: Convolution (potentially strided), ReLu, Batch Normalization, and Component-wise addition, with the blocks terminating with a global average pooling operation \cite{he2016deep} before a fully connected layer performs the final classification. Our goal will be to develop JPEG domain equivalents to these five operations. Network activations are given as a single tensor holding a batch of multi-channel images, that is $I \in V^*\otimes V^*\otimes V^*\otimes V^*$.

\subsection{Convolution}

The convolution operation follows directly from the discussion in Section \ref{sec:backjlm}. With our updated image model, the convolution operation in the spatial domain is a shorthand notation for a linear map $C: V^*\otimes V^*\otimes V^*\otimes V^*\rightarrow V^*\otimes V^*\otimes V^*\otimes V^*$. Since the same operation is applied to each image in the batch, we can represent $C$ with a type (3, 3) tensor. The entries of this tensor give the coefficient for a given pixel in a given input channel for each pixel in each output channel. This notation can describe any multichannel linear pixel manipulation. We now develop the algorithm for representing discrete convolutional filters using this data structure.

The naive algorithm is a simple iterative method which copies the coefficients from the given filter by noting that a given output pixel under the discrete convolution is linear combination of it's surrounding pixels. Stride is incorporated by simply skipping over input pixels. After constructing the linear map, it can be combined as shown in Equation \ref{eq:stoj} to make the equivalent JPEG domain map. While this algorithm does correctly produce the spatial domain linear map, and it is easy to understand, it is slow to construct, difficult to parallelize on a modern GPU, and incurs high memory overhead. In a sense, this is the brute force approach.

The convolution operation follows directly from the discussion in Section \ref{sec:backjlm}. The convolution operation in the spatial domain is a shorthand notation for a linear map $C: V^*\otimes V^*\otimes V^*\otimes V^*\rightarrow V^*\otimes V^*\otimes V^*\otimes V^*$. Since the same operation is applied to each image in the batch, we can represent $C$ with a type (3, 3) tensor. The entries of this tensor give the coefficient for a given pixel in a given input channel for each pixel in each output channel. This notation can describe any multichannel linear pixel manipulation. We now develop the algorithm for representing discrete convolutional filters using this data structure.

A more efficient algorithm would produce the JPEG domain operation directly and be easy to express as a compute kernel for a GPU. Start by considering the JPEG decompression tensor $\widetilde{J}$. Note that since

\begin{equation}

\widetilde{J}\in V \otimes V \otimes V \otimes V^* \otimes V^*

\end{equation}

the last two indices of $\widetilde{J}$ form single channel image under our image model (\eg the last two indices are in $V^*\otimes V^*$). If the convolution can be applied to this "image", then the resulting map would decompress and convolve simultaneously. We can formulate a new tensor

\begin{equation}

\widehat{J}\in V \otimes V^* \otimes V^*

\end{equation}

A naive algorithm can simply copy randomly initialized convolution weights into this larger structure following the formula for a 2D discrete cross-correlation and then apply the JPEG compression and decompression tensors to the result, but this is difficult to parallelize and incurs additional memory overhead to store the spatial domain operation. A more efficient algorithm would produce the JPEG domain operation directly and be easy to express as a compute kernel for a GPU. Start by considering the JPEG decompression tensor $\widetilde{J}$. Note that since $\widetilde{J}\in V \otimes V \otimes V \otimes V^*\otimes V^*$ the last two indices of $\widetilde{J}$ form single channel image under our image model (\eg the last two indices are in $V^*\otimes V^*$). If the convolution can be applied to this "image", then the resulting map would decompress and convolve simultaneously. We can formulate a new tensor $\widehat{J}\in V \otimes V^*\otimes V^*$

by reshaping $\widetilde{J}$ and treating this as a batch of images, then, given the initialized convolution filter, $K$ computing

\begin{equation}

\widehat{C}^b = \widehat{J}^b \star K

\end{equation}

gives us the desired map. After reshaping $\widehat{C}$ back to the original shape of $\widetilde{J}$, we have

\begin{equation}

\widetilde{C}\in V \otimes V \otimes V \otimes V \otimes V^* \otimes V^* \otimes V^*

\end{equation}

and the full compressed domain operation can be expressed as

gives us the desired map. After reshaping $\widehat{C}$ back to the original shape of $\widetilde{J}$ to give $\widetilde{C}$, the full compressed domain operation can be expressed as

where $m$ and $n$ index the input and output channels of the image respectively. This algorithm skips the overhead of computing the spatial domain map explicitly and depends only on the batch convolution operation which is available in all GPU accelerated deep learning libraries. The algorithm is shown in Algorithm \ref{alg:dce}.

\begin{algorithm}

\caption{Direct Convolution Explosion. $K$ is an initial filter, $m, n$ are the input and output channels, $h, w$ are the image height and width, $s$ is the stride, $\star_s$ denotes the discrete convolution with stride $s$}

where $m$ and $n$ index the input and output channels of the image respectively. This algorithm skips the overhead of computing the spatial domain map explicitly and depends only on the batch convolution operation which is available in all GPU accelerated deep learning libraries. The algorithm is shown in the supplementary material.

\subsection{ReLu}

...

...

@@ -62,7 +31,7 @@ Although this is one of the simplest piecewise linear functions to study, it sti

To develop this technique we must balance two seemingly competing criteria. The first is that we want to use the JPEG transform domain, since its sparsity has a computational advantage over the spatial domain. The second is that we want to compute a non-linear function which is incompatible with the JPEG transform. Can we balance these two constrains by sacrificing a third criterion? Consider an approximation of the spatial domain image that uses only a subset of the DCT coefficients. Computing this is fast, since it does not use the full set of coefficients, and gives us a spatial domain representation which is compatible with the non-linearity. What we sacrifice is accuracy. The accuracy-speed tradeoff is tunable to the problem by changing the size of the set of coefficients.

The question now is how do we choose the subset of coefficients to use. Theorem \ref{thm:dctls}stated that given$N$DCT coefficients, the best reconstruction with a subset of size $m$ uses the lowest $m$ frequencies. To extend this to the 2D case we adopt the standard definition of the 2D spatial frequency as follows: given a matrix of DCT coefficients, $D$, the spatial frequency $\phi$ of a given entry $(i,j)$ is

By Theorem \ref{thm:dctls}we use the lowest$m$frequencies for an optimal reconstruction. To extend this to the 2D case we adopt the standard definition of the 2D spatial frequency as follows: given a matrix of DCT coefficients, $D$, the spatial frequency $\phi$ of a given entry $(i,j)$ is

that takes a DCT block, $F$, and a mask $M$, and produces the masked DCT block by pixelwise multiplication. Our task will be to derive the form of $H$. We proceed by construction. The steps of such an algorithm naively would be

\begin{enumerate}

\item Take the inverse DCT of F,$I=\text{IDCT}(F)$

\item Pixelwise multiply $I'= I \circ F$

\item Take the DCT of $I$, $F'= DCT(I')$.

\item Take the inverse DCT of F:$I_{ij}= D^{\alpha\beta}_{ij}F_{\alpha\beta}$

\item Pixelwise multiply $I'_{ij}= I_{ij}M_{ij}$

\item Take the DCT of $I$, $F'_{\alpha'\beta'}= D^{ij}_{\alpha'\beta'}I'_{ij}$.

\end{enumerate}

We can express these steps as the following three linear map applications

Note that we have encapsulated the raising of the indices $\alpha, \beta$ on $F$ in this tensor, which we can do since, again, we are only considering the standard orthonormal basis.

We call $H$ the Harmonic Mixing Tensor since it gives all the spatial frequency permutations that we need. $H$ can be precomputed to speed up the resulting computation and the mask result is computed as

We call $H$ the Harmonic Mixing Tensor since it gives all the spatial frequency permutations that we need. $H$ can be precomputed to speed up the resulting computation.

To use this technique to compute the ReLu function, consider this alternative formulation

\newcommand{\nnm}{\mathrm{nnm}}

...

...

@@ -121,50 +76,7 @@ We call the function $\nnm(x)$ the nonnegative mask of $x$, this is our binary m

\begin{equation}

r(x) = \nnm(x)x

\end{equation}

This new function can be computed efficiently from few spatial frequencies with much higher accuracy since only the sign of the original function needs to be correct. In a sense this is one of the largest intervals that can be used to produce a mask, so it is a favorable situation for ASM. The DCT domain algorithm is summarized in Algorithm \ref{alg:asmr}.

We conclude this discussion by extending to the JPEG transform domain. This can be done by simply applying the rest of the JPEG transform operations to give a new definition of $H$ (see Section \ref{sec:backjlm} for the form of these tensors).

This new function can be computed efficiently from few spatial frequencies with much higher accuracy since only the sign of the original function needs to be correct. This algorithm is given in the supplementary material. To extend this method from the DCT domain to the JPEG transform domain, the rest of the missing JPEG tensor can simply be applied to $H$.

In other words, the (0,0) DCT coefficient is proportional to the mean of the block. Further, since the DCT basis is orthonormal, we can be sure that the remaining DCT coefficients do not depend on the mean. This means that to center the image we need only set the (0,0) DCT coefficient to 0. This, of course, is unaffected by the other steps of the JPEG transform. For tracking the running mean, we simply read this value. Note that this is a much more efficient operation than the mean computation in the spatial domain.

Next, to get the variance, we can use the following theorem.

Next, to get the variance, we can use the following theorem:

\begin{theorem}[The DCT Mean-Variance Theorem]

Given a set of samples of a signal $X$ such that $\e[X]=0$, let $Y$ be the DCT coefficients of $X$. Then

\begin{equation}

\var[X] = \e[Y^2]

\end{equation}

\end{theorem}

Intuitively this makes sense because the (0,0) coefficient represents the mean, the remaining DCT coefficients are essentially spatial oscillations around the mean, which should define the variance. Proof of this theorem is given in the supplementary material. Now we can compute the variance in the DCT domain, this has the same efficiency as the spatial domain computation.

Intuitively this makes sense because the (0,0) coefficient represents the mean, the remaining DCT coefficients are essentially spatial oscillations around the mean, which should define the variance. Proof of this theorem is given in the supplementary material.

To apply $\gamma$ and the variance, we use scalar multiplication. This follows directly from the definition of the DCT (and JPEG) as a linear map

\begin{equation}

J(\gamma I) = \gamma J(I)

\end{equation}

for scalar addition to apply $\beta$, note that since the (0,0) coefficient is the mean, and adding $\beta$ to every pixel in the image is equivalent to raising the mean by $\beta$, we can simply add $\beta$ to the (0,0) coefficient making sure to scale by the proportionality constant.

For scalar addition to apply $\beta$, note that since the (0,0) coefficient is the mean, and adding $\beta$ to every pixel in the image is equivalent to raising the mean by $\beta$, we can simply add $\beta$ to the (0,0) coefficient.

To extend this to JPEG is simple. For an $8\times8$ block, the proportionality constant for the (0,0) coefficient becomes $\frac{1}{2\sqrt{2\times8}}=\frac{1}{8}$. For this reason, many quantization matrices use $8$ as the (0,0) quantization coefficient. This means that the 0th block entry for a block does not need any proportionality constant, it stores exactly the mean. This means for adding $\beta$, we can simply set the 0th position to $\beta$ without performing and mathematical operations. The other operations are unaffected.

...

...

@@ -210,14 +122,7 @@ meaning that we can simply perform a component-wise addition of the JPEG compres

Global average pooling also has a simple formulation in JPEG domain. Recall from the discussion of Batch Normalization (Section \ref{sec:jdrbn}) that the (0,0) DCT coefficient is proportional to the mean of the image, and that the 0th element of the block after quantization is equal to the mean of the block. Then this element can be extracted channel-wise from each block and the global average pooling result is the channel-wise mean of these elements.

Furthermore, our network architecture for classification will always reduce the input images to a single block, which can then have its mean extracted and reported as the global average pooling result directly. This is illustrated graphically in Figure \ref{fig:gap}. Note the efficiency of this process, rather than the channel-wise averaging in a spatial domain network, we simply have an unconditional read operation, one per channel.

\caption{DCT Domain Global Average Pooling. The first DCT coefficient of each channel becomes the average pooling vector entries with no need for further computation.}

\label{fig:gap}

\end{figure}

Furthermore, our network architecture for classification will always reduce the input images to a single block, which can then have its mean extracted and reported as the global average pooling result directly. Note the efficiency of this process, rather than the channel-wise averaging in a spatial domain network, we simply have an unconditional read operation, one per channel.

% Pages are numbered in submission mode, and unnumbered in camera-ready

\ificcvfinal\pagestyle{empty}\fi

\setcounter{page}{4321}

\addbibresource{bibliography.bib}

\begin{document}

\title{Supplementary Material}

\maketitle

\section{Proof of the DCT Least Squares Theorem}

\section{Proof of the DCT Mean-Variance Theorem}

\section{Algorithms}

\begin{algorithm}

\caption{Direct Convolution Explosion. $K$ is an initial filter, $m, n$ are the input and output channels, $h, w$ are the image height and width, $s$ is the stride, $\star_s$ denotes the discrete convolution with stride $s$}