@@ -40,32 +40,32 @@ Theorem \ref{thm:dctls} states that a reconstruction using the $m$ lowest spatia

A key observation of the JPEG algorithm, and the foundation of most compressed domain processing methods \cite{chang1992video, chang1993new, natarajan1995fast, shen1995inner, shen1996direct, shen1998block, smith1993algorithms, smith1994fast} is that steps 1-4 of the JPEG compression algorithm are linear maps, so they can be composed, along with other linear operations, into a single linear map which performs the operations on the compressed representation. Step 5, the rounding step, cannot be undone and Step 6, the entropy coding, is nonlinear and therefore must be undone. We define the JPEG Transform Domain as the output of Step 4 in the JPEG encoding algorithm. Inputs the the algorithms described here will be JPEGs after reversing the entropy coding.

Formally, we model a single plane image as the type (0, 2) tensor $I \inV^*\otimesV^*$for some vector space $V$ and its dual $V^*$. Note that we take all tensors as a tensor product combination of $V$ and $V^*$without loss of generality. In real images, the dimensions have physical meaning (\eg width and height of the image) and will be of different sizes. The basis of $V$ is always the standard orthonormal basis, this is important as it allows the free raising and lowering of indices without the use of a metric tensor.

We define the JPEG transform $J \inV\otimesV\otimesV^*\otimesV^*\otimesV^*$$J$ represents a linear map $J: V^*\otimesV^*\rightarrowV^*\otimesV^*\otimesV^*$ which is computed as (in Einstein notation)

Formally, we model a single plane image as the type (0, 2) tensor $I \inH^*\otimesW^*$where $H$ and $W$ are vector spaces and $*$denotes the dual space. We always the standard orthonormal basis for these vector spaces, this is important as it allows the free raising and lowering of indices without the use of a metric tensor.

We define the JPEG transform $J \inH\otimesW\otimesX^*\otimesY^*\otimesK^*$.$J$ represents a linear map $J: H^*\otimesW^*\rightarrowX^*\otimesY^*\otimesK^*$ which is computed as (in Einstein notation)

\begin{equation}

I'_{xyk} = J^{sr}_{xyk}I_{sr}

I'_{xyk} = J^{hw}_{xyk}I_{hw}

\end{equation}

and we say that $I'$ is the representation of $I$ in the JPEG transform domain. In the above equation, the indices $s,r$ give the pixel position, the indices $x,y$ give the block position, and the index $k$ gives the offset into the block.

and we say that $I'$ is the representation of $I$ in the JPEG transform domain. In the above equation, the indices $h,w$ give the pixel position, the indices $x,y$ give the block position, and the index $k$ gives the offset into the block.

The form of $J$ is constructed from the JPEG compression steps listed in the previous section. Let the linear map $B: V^*\otimesV^*\rightarrowV^*\otimesV^*\otimesV^*\otimesV^*$ be defined as

The form of $J$ is constructed from the JPEG compression steps listed in the previous section. Let the linear map $B: H^*\otimesW^*\rightarrowX^*\otimesY^*\otimesI^*\otimesJ^*$ be defined as

\begin{equation}

B^{sr}_{xyij} = \left\{\begin{array}{lr} 1 &\text{$s,r$ belongs in block $x,y$ at offset $i,j$}\\ 0 &\text{otherwise}\end{array}\right.

B^{hw}_{xyij} = \left\{\begin{array}{lr} 1 &\text{$h,w$ belongs in block $x,y$ at offset $i,j$}\\ 0 &\text{otherwise}\end{array}\right.

\end{equation}

then $B$ can be used to break the image represented by $I$ into blocks of a given size such that the first two indices $x,y$ index the block position and the last two indices $i,j$ index the offset into the block.

Next. let the linear map $D: V^*\otimesV^*\rightarrowV^*\otimesV^*$ be defined as

Next. let the linear map $D: I^*\otimesJ^*\rightarrowA^*\otimesB^*$ be defined as

then $D$ represents the 2D discrete forward (and inverse) DCT. Let $Z: V^*\otimesV^*\rightarrowV^*$ be defined as

then $D$ represents the 2D discrete forward (and inverse) DCT. Let $Z: A^*\otimesB^*\rightarrow\Gamma^*$ be defined as

\begin{equation}

Z^{\alpha\beta}_\gamma = \left\{\begin{array}{lr} 1 &\text{$\alpha, \beta$ is at $\gamma$ under zigzag ordering}\\ 0 &\text{otherwise}\end{array}\right.

\end{equation}

then $Z$ creates the zigzag ordered vectors. Finally, let $S: V^*\rightarrowV^*$ be

then $Z$ creates the zigzag ordered vectors. Finally, let $S: \Gamma^*\rightarrowK^*$ be

\begin{equation}

S^\gamma_k = \frac{1}{q_k}

...

...

@@ -75,7 +75,7 @@ where $q_k$ is a quantization coefficient, $S$ can be used to scale the vector e

With linear maps for each step of the JPEG transform, we can then apply them to each other to create the $J$ tensor that was described at the beginning of this section

The inverse mapping also exists as a tensor $\widetilde{J}$ which can be defined using the same linear maps with the exception of $S$. Let $\widetilde{S}$ be

...

...

@@ -86,14 +86,14 @@ The inverse mapping also exists as a tensor $\widetilde{J}$ which can be defined

noting that, for all tensors other than $\widetilde{S}$, we have freely raised and lowered indices without the use of a metric tensor on $V$since we consider only the standard orthonormal basis, as stated earlier.

noting that, for all tensors other than $\widetilde{S}$, we have freely raised and lowered indices without the use of a metric tensor since we consider only the standard orthonormal basis, as stated earlier.

Next consider a linear map $C: V^*\otimesV^*\rightarrowV^*\otimesV^*$ which performs an arbitrary pixel manipulation on an image plane $I$. To apply this mapping to a JPEG image $I'$, we would first decompress the image, apply $C$ to the result, then compress that result to get the final JPEG. Since compressing is an application of $J$ and decompressing is an application of $\widetilde{J}$, we can form a new linear map $\Xi: V^*\otimesV^*\otimesV^*\rightarrowV^*\otimesV^*\otimesV^*$ as

Next consider a linear map $C: H^*\otimesW^*\rightarrowH^*\otimesW^*$ which performs an arbitrary pixel manipulation on an image plane $I$. To apply this mapping to a JPEG image $I'$, we would first decompress the image, apply $C$ to the result, then compress that result to get the final JPEG. Since compressing is an application of $J$ and decompressing is an application of $\widetilde{J}$, we can form a new linear map $\Xi: X^*\otimesY^*\otimesK^*\rightarrowX^*\otimesY^*\otimesK^*$ as

which applies $C$ in the JPEG transform domain. There are two important points to note about $\Xi$. The first is that, although it encapsulates decompression, applying $C$ and compressing, it uses far fewer operations than doing these processes separately since the coefficients are multiplied out. The second is that it is mathematically equivalent to performing $C$ on the decompressed image and compressing the result, it is not an approximation.

The ResNet architecture, generally, consists of blocks of four basic operations: Convolution (potentially strided), ReLu, Batch Normalization, and Component-wise addition, with the blocks terminating with a global average pooling operation \cite{he2016deep} before a fully connected layer performs the final classification. Our goal will be to develop JPEG domain equivalents to these five operations. Network activations are given as a single tensor holding a batch of multi-channel images, that is $I \inV^*\otimesV^*\otimesV^*\otimesV^*$.

The ResNet architecture, generally, consists of blocks of four basic operations: Convolution (potentially strided), ReLu, Batch Normalization, and Component-wise addition, with the blocks terminating with a global average pooling operation \cite{he2016deep} before a fully connected layer performs the final classification. Our goal will be to develop JPEG domain equivalents to these five operations. Network activations are given as a single tensor holding a batch of multi-channel images, that is $I \inN^*\otimesC^*\otimesH^*\otimesW^*$.

\subsection{Convolution}

The convolution operation follows directly from the discussion in Section \ref{sec:backjlm}. The convolution operation in the spatial domain is a shorthand notation for a linear map $C: V^*\otimesV^*\otimesV^*\otimesV^*\rightarrowV^*\otimesV^*\otimesV^*\otimesV^*$. Since the same operation is applied to each image in the batch, we can represent $C$ with a type (3, 3) tensor. The entries of this tensor give the coefficient for a given pixel in a given input channel for each pixel in each output channel. This notation can describe any multichannel linear pixel manipulation. We now develop the algorithm for representing discrete convolutional filters using this data structure.

The convolution operation follows directly from the discussion in Section \ref{sec:backjlm}. The convolution operation in the spatial domain is a shorthand notation for a linear map $C: N^*\otimesC^*\otimesH^*\otimesW^*\rightarrowN^*\otimesC^*\otimesH^*\otimesW^*$. Since the same operation is applied to each image in the batch, we can represent $C$ with a type (3, 3) tensor. The entries of this tensor give the coefficient for a given pixel in a given input channel for each pixel in each output channel. This notation can describe any multichannel linear pixel manipulation. We now develop the algorithm for representing discrete convolutional filters using this data structure.

A naive algorithm can simply copy randomly initialized convolution weights into this larger structure following the formula for a 2D discrete cross-correlation and then apply the JPEG compression and decompression tensors to the result, but this is difficult to parallelize and incurs additional memory overhead to store the spatial domain operation. A more efficient algorithm would produce the JPEG domain operation directly and be easy to express as a compute kernel for a GPU. Start by considering the JPEG decompression tensor $\widetilde{J}$. Note that since $\widetilde{J}\inV\otimesV\otimesV\otimesV^*\otimesV^*$ the last two indices of $\widetilde{J}$ form single channel image under our image model (\eg the last two indices are in $V^*\otimesV^*$). If the convolution can be applied to this "image", then the resulting map would decompress and convolve simultaneously. We can formulate a new tensor $\widehat{J}\inV\otimesV^*\otimesV^*$

A naive algorithm can simply copy randomly initialized convolution weights into this larger structure following the formula for a 2D discrete cross-correlation and then apply the JPEG compression and decompression tensors to the result, but this is difficult to parallelize and incurs additional memory overhead to store the spatial domain operation. A more efficient algorithm would produce the JPEG domain operation directly and be easy to express as a compute kernel for a GPU. Start by considering the JPEG decompression tensor $\widetilde{J}$. Note that since $\widetilde{J}\inX\otimesY\otimesK\otimesH^*\otimesW^*$ the last two indices of $\widetilde{J}$ form single channel image under our image model (\eg the last two indices are in $H^*\otimesW^*$). If the convolution can be applied to this "image", then the resulting map would decompress and convolve simultaneously. We can formulate a new tensor $\widehat{J}\inN\otimesH^*\otimesW^*$

by reshaping $\widetilde{J}$ and treating this as a batch of images, then, given the initialized convolution filter, $K$ computing

\begin{equation}

\widehat{C}^b = \widehat{J}^b \star K

\end{equation}

gives us the desired map. After reshaping $\widehat{C}$ back to the original shape of $\widetilde{J}$ to give $\widetilde{C}$, the full compressed domain operation can be expressed as

where $m$ and $n$ index the input and output channels of the image respectively. This algorithm skips the overhead of computing the spatial domain map explicitly and depends only on the batch convolution operation which is available in all GPU accelerated deep learning libraries. The algorithm is shown in the supplementary material.

...

...

@@ -43,7 +43,7 @@ What we can do to solve the first problem is to look at the ranges that the line

The final problem is that we now have a mask in the spatial domain, but the original image is in the DCT domain. There is a well known algorithm for pixelwise multiplication of two DCT images \cite{smith1993algorithms}, but it would require the mask to also be in the DCT domain. Fortunately, there is a straightforward solution that comes as a result of the multilinear analysis given in Section \ref{sec:backjlm}. Consider the bilinear map

that takes a DCT block, $F$, and a mask $M$, and produces the masked DCT block by pixelwise multiplication. Our task will be to derive the form of $H$. We proceed by construction. The steps of such an algorithm naively would be