This process is repeated independently for each image plane. In most cases, the original image is transformed from the RGB color space to YUV and chroma subsampling is applied since the human visual system is less sensitive to small color changes than to small brightness changes \cite{winkler2001vision}. The decompression algorithm is the inverse process. Note that the rounding step (step 5) must be skipped during decompression, this is the step in JPEG compression where information is lost and is the cause of artifacting in decompressed JPEG images.
...
...
@@ -48,8 +40,8 @@ Theorem \ref{thm:dctls} states that a reconstruction using the $m$ lowest spatia
A key observation of the JPEG algorithm, and the foundation of most compressed domain processing methods \cite{chang1992video, chang1993new, natarajan1995fast, shen1995inner, shen1996direct, shen1998block, smith1993algorithms, smith1994fast} is that steps 1-4 of the JPEG compression algorithm are linear maps, so they can be composed, along with other linear operations, into a single linear map which performs the operations on the compressed representation. Step 5, the rounding step, cannot be undone and Step 6, the entropy coding, is nonlinear and therefore must be undone. We define the JPEG Transform Domain as the output of Step 4 in the JPEG encoding algorithm. Inputs the the algorithms described here will be JPEGs after reversing the entropy coding.
Formally, we model a single plane image as the type (0, 2) tensor $I \in V^*\otimes V^*$ for some vector space $V$ and its dual $V^*$. Note that we take all tensors as a tensor product combination of $V$ and $V^*$ without loss of generality. In real images, the dimensions have physical meaning (\eg width and height of the image) and will be of different sizes. The analysis presented in this work applies to any vector space although in real images we are dealing with floating point numbers. The basis of $V$ is always the standard orthonormal basis, this is important as it allows the free raising and lowering of indices without the use of a metric tensor.
We define the JPEG transform $J \in V \otimes V \otimes V^*\otimes V^*\otimes V^*$, a type (2,3) tensor. Then$J$ represents a linear map $J: V^*\otimes V^*\rightarrow V^*\otimes V^*\otimes V^*$ which is computed as (in Einstein notation)
Formally, we model a single plane image as the type (0, 2) tensor $I \in V^*\otimes V^*$ for some vector space $V$ and its dual $V^*$. Note that we take all tensors as a tensor product combination of $V$ and $V^*$ without loss of generality. In real images, the dimensions have physical meaning (\eg width and height of the image) and will be of different sizes. The basis of $V$ is always the standard orthonormal basis, this is important as it allows the free raising and lowering of indices without the use of a metric tensor.
We define the JPEG transform $J \in V \otimes V \otimes V^*\otimes V^*\otimes V^*$$J$ represents a linear map $J: V^*\otimes V^*\rightarrow V^*\otimes V^*\otimes V^*$ which is computed as (in Einstein notation)
\begin{equation}
I'_{xyk} = J^{sr}_{xyk}I_{sr}
...
...
@@ -98,7 +90,7 @@ Then
\end{equation}
noting that, for all tensors other than $\widetilde{S}$, we have freely raised and lowered indices without the use of a metric tensor on $V$ since we consider only the standard orthonormal basis, as stated earlier.
Next consider a linear map $C: V^*\otimes V^*\rightarrow V^*\otimes V^*$ which performs an arbitrary pixel manipulation on an image $I$. To apply this mapping to a JPEG image $I'$, we would first decompress the image, apply $C$ to the result, then compress that result to get the final JPEG. Since compressing is an application of $J$ and decompressing is an application of $\widetilde{J}$, we can form a new linear map $\Xi: V^*\otimes V^*\otimes V^*\rightarrow V^*\otimes V^*\otimes V^*$ as
Next consider a linear map $C: V^*\otimes V^*\rightarrow V^*\otimes V^*$ which performs an arbitrary pixel manipulation on an image plane $I$. To apply this mapping to a JPEG image $I'$, we would first decompress the image, apply $C$ to the result, then compress that result to get the final JPEG. Since compressing is an application of $J$ and decompressing is an application of $\widetilde{J}$, we can form a new linear map $\Xi: V^*\otimes V^*\otimes V^*\rightarrow V^*\otimes V^*\otimes V^*$ as
@@ -15,14 +15,41 @@ Since we are concerned with reproducing the inference results of spatial domain
\subsection{Model Conversion}
\TODO Show that on both datasets over several (maybe a hundred?) of both models trained in spatial domain, testing in spatial domain matches testing in JPEG domain.
For this first experiment, we provide empirical evidence that the JPEG formulation presented in this paper is mathematically equivalent to spatial domain network. To show this, we train 100 spatial domain models on each of three datasets and give their mean testing accuracies. When then use model conversion to transform the pretrained models to the JPEG domain and give the mean testing accuracies of the JPEG models. The images are losslessly JPEG compressed for input to the JPEG networks and the exact (15 spatial frequency) ReLu formulation is used. The result of this test is given in Table \ref{tab:mc}. Since the accuracy difference between the networks is extremely small, the deviation is also included.
\begin{table}[h]
\begin{tabular}{|r|l|l|l|}
\hline
Dataset & Spatial & JPEG & Deviation \\\hline
MNIST & 0.988 & 0.988 & 2.999e-06 \\\hline
CIFAR10 & 0.725 & 0.725 & 9e-06 \\\hline
CIFAR100 & 0.385 & 0.385 & 1e-06 \\\hline
\end{tabular}
\caption{Model conversion accuracies. Spatial and JPEG testing accuracies are the same to within floating point error.}
\label{tab:mc}
\end{table}
% CIFAR10:
\subsection{ReLu Approximation Accuracy}
\label{sec:exprla}
\TODO two things to show here. The first is a large scale test over $8\times8$ blocks that shows how the error of the ReLu approximation itself changes from 0-14 spatial frequencies (same thing from the notes). Second is a similar test but using (again like 100) models of both types for both datasets, show how the testing accuracy changes from 0-14 spatial frequencies. Finally, show that if you train and test like 100 models using 0-14 spatial frequencies, show that there's less error because the convolutional weights will learn to cope with the appx.
Next, we with to examine the impact of the ReLu approximation. We start by examining the raw error on individual $8\times8$ blocks. For this test, we take random $4\times4$ pixel blocks in the range $[-1, 1]$ and scale them to $8\times8$ using a box filter. Fully random $8\times8$ blocks do not accurately represent the statistics of real images and are known to be a worst case for the DCT transform. The $4\times4$ blocks allow for a large random sample size while still approximating real image statistics. We take 10 million such blocks and compute the average RMSE of our Approximated Spatial Masking (ASM) technique and compare it to computing ReLu directly on the approximation (APX). This test is repeated for all one to fifteen spatial frequencies. The result, shown in Figure \ref{fig:rba} shows that our ASM method gives a better approximation (lower RMSE) through the range of spatial frequencies.
\caption{ReLu blocks accuracy. Our ASM method consistently gives lower error than the naive approximation method.}
\label{fig:rba}
\end{figure}
This test provides a strong motivation for the ASM method, so we move on to testing it in the model conversion setting. For this test, we again train 100 spatial domain models and then perform model conversion with the ReLu layers ranging from 1-15 spatial frequencies. We again compare our ASM method with the APX method. The result is given in Figure \ref{fig:ra}, again the ASM method outperforms the APX method.
\caption{ReLu model conversion accuracy. ASM again outperforms the naive approximation. The spatial domain accuracy is given for each dataset with dashed lines.}
\label{fig:ra}
\end{figure}
As a final test, we show that if the models are trained in the JPEG domain, the CNN weights will actually learn to cope with the approximation and fewer spatial frequencies are required to get good accuracy. We again compare ASM to APX in this setting.
% Pages are numbered in submission mode, and unnumbered in camera-ready
\ificcvfinal\pagestyle{empty}\fi
\setcounter{page}{4321}
\addbibresource{bibliography.bib}
\begin{document}
\title{Supplementary Material}
\maketitle
\section{Proof of the DCT Least Squares Theorem}
\section{Proof of the DCT Mean-Variance Theorem}
\section{Algorithms}
\begin{algorithm}
\caption{Direct Convolution Explosion. $K$ is an initial filter, $m, n$ are the input and output channels, $h, w$ are the image height and width, $s$ is the stride, $\star_s$ denotes the discrete convolution with stride $s$}