\section{Experiments}
We give experimental evidence for the efficacy of our method, starting with a brief discussion of the architectures we use and the datasets for experimentation. We use model conversion as a sanity check, ensuring that the JPEG model with exact ReLu matches exactly the testing accuracy of a spatial domain model. Next we show how the ReLu approximation accuracy effects the overall network performance. We conclude by showing the training and testing time advantage of our method.
\subsection{Network Architectures and Datasets}
Since we are concerned with reproducing the inference results of spatial domain networks, we choose the MNIST \cite{lecun1998mnist} and CIFAR-10/100 \cite{krizhevsky2009learning} datasets since they are easy to work with. The MNIST images are padded to $32 \times 32$ before being used to ensure an even number of JPEG blocks. Our network architecture is similarly simple is shown in Figure \ref{fig:na}. The classification network consists of three residual blocks with the final two performing downsampling so that the final feature map consists of a single JPEG block.
\begin{figure}
\centering
\includegraphics[width=0.5\linewidth]{figures/network.pdf}
\caption{Simple network architecture. $T$ indicates the batch size.}
\label{fig:na}
\end{figure}
\subsection{Model Conversion}
For this first experiment, we provide empirical evidence that the JPEG formulation presented in this paper is mathematically equivalent to spatial domain network. To show this, we train 100 spatial domain models on each of the three datasets and give their mean testing accuracies. When then use model conversion to transform the pretrained models to the JPEG domain and give the mean testing accuracies of the JPEG models. The images are losslessly JPEG compressed for input to the JPEG networks and the exact (15 spatial frequency) ReLu formulation is used. The result of this test is given in Table \ref{tab:mc}. Since the accuracy difference between the networks is extremely small, the deviation is also included.
\begin{table}[h]
\centering
\begin{tabular}{|r|l|l|l|}
\hline
Dataset & Spatial & JPEG & Deviation \\ \hline
MNIST & 0.988 & 0.988 & 2.999e-06 \\ \hline
CIFAR10 & 0.725 & 0.725 & 9e-06 \\ \hline
CIFAR100 & 0.385 & 0.385 & 1e-06 \\ \hline
\end{tabular}
\caption{Model conversion accuracies. Spatial and JPEG testing accuracies are the same to within floating point error.}
\label{tab:mc}
\end{table}
\subsection{ReLu Approximation Accuracy}
\label{sec:exprla}
Next, we examine the impact of the ReLu approximation. We start by examining the raw error on individual $8 \times 8$ blocks. For this test, we take random $4 \times 4$ pixel blocks in the range $[-1, 1]$ and scale them to $8 \times 8$ using a box filter. Fully random $8 \times 8$ blocks do not accurately represent the statistics of real images and are known to be a worst case for the DCT transform. The $4 \times 4$ blocks allow for a large random sample size while still approximating real image statistics. We take 10 million such blocks and compute the average RMSE of our Approximated Spatial Masking (ASM) technique and compare it to computing ReLu directly on the approximation (APX). This test is repeated for all one to fifteen spatial frequencies. The result, shown in Figure \ref{fig:rba} shows that our ASM method gives a better approximation (lower RMSE) through the range of spatial frequencies.
\begin{figure*}
\centering
\begin{subfigure}{0.33\textwidth}
\captionsetup{width=.8\linewidth}
\centering
\includegraphics[width=\textwidth]{plots/relu_blocks.eps}
\caption{ReLu blocks error. Our ASM method consistently gives lower error than the naive approximation method.}
\label{fig:rba}
\end{subfigure}%
\begin{subfigure}{0.33\textwidth}
\captionsetup{width=.8\linewidth}
\centering
\includegraphics[width=\textwidth]{plots/relu_accuracy.eps}
\caption{ReLu model conversion accuracy. ASM again outperforms the naive approximation. The spatial domain accuracy is given for each dataset with dashed lines.}
\label{fig:ra}
\end{subfigure}%
\begin{subfigure}{0.33\textwidth}
\captionsetup{width=.8\linewidth}
\centering
\includegraphics[width=\textwidth]{plots/relu_training.eps}
\caption{ReLu training accuracy. The network weights have learned to correct for the ReLu approximation allowing fewer spatial frequencies to be used for high accuracy.}
\label{fig:rt}
\end{subfigure}
\end{figure*}
This test provides a strong motivation for the ASM method, so we move on to testing it in the model conversion setting. For this test, we again train 100 spatial domain models and then perform model conversion with the ReLu layers ranging from 1-15 spatial frequencies. We again compare our ASM method with the APX method. The result is given in Figure \ref{fig:ra}, again the ASM method outperforms the APX method.
As a final test, we show that if the models are trained in the JPEG domain, the CNN weights will actually learn to cope with the approximation and fewer spatial frequencies are required to get good accuracy. We again compare ASM to APX in this setting. The result shown in Figure \ref{fig:rt} shows that the ASM method again outperforms the APX method and that the network weights have learned to cope with the approximation.
\subsection{Efficiency of Training and Testing}
\begin{figure}[b]
\includegraphics[width=\linewidth]{plots/throughput.eps}
\caption{Throughput. The JPEG model has a more complex gradient which limits speed improvement during training. Inference, however, sees considerably higher throughput.}
\label{fig:rt}
\end{figure}
Finally, we show the throughput for training and testing. For this we test on all three datasets by training and testing a spatial model and training and testing a JPEG model and measuring the time taken. This is then converted to an average throughput measurement. The experiment is performed on an NVIDIA Pascal GPU with a batch size of 40 images. The results, shown in Figure \ref{fig:rt}, show that the JPEG model is able to outperform the spatial model in all cases, but that the performance on training is still limited. This is likely because of the more complex gradient created by the convolution and ReLu operations. At inference time, however, performance is greatly improved over the spatial model.