\section{Introduction}
The popularization of deep learning since the 2012 AlexNet \cite{krizhevsky2012imagenet} architecture has led to unprecedented gains for the field. Many applications that were once academic are now seeing widespread use of machine learning with success. Although the performance of deep neural networks far exceeds classical methods, there are still some major problems with the algorithms from a computational standpoint. Deep networks require massive amounts of data to learn effectively, especially for complex problems \cite{najafabadi2015deep}. Further, the computational and memory demands of deep networks mean that for many large problems, only large institutions with GPU clusters can afford to train from scratch, leaving the average scientist to fine tune pre-trained weights.
This problem has been addressed many times in the literature. Batch normalization \cite{ioffe2015batch} is ubiquitous in modern networks to accelerate their convergence. Residual learning \cite{he2016deep} allows for much deeper networks to learn effective mappings without overfitting. Techniques such as pruning and weight compression \cite{han2015deep} are becoming more commonplace. As problems become even larger and more complex, these techniques are increasingly being relied upon for efficient training and inference.
We tackle this problem at the level of the image representation. The JPEG image compression algorithm is the most widespread image file format. Traditionally, the first step in using JPEGs for machine learning is to decompress them. We propose to skip this step and instead reformulate the ResNet architecture to perform its operations directly on compressed images. The goal is to produce a new network that is mathematically equivalent to the spatial domain network, but which operates on compressed images by composing the compression transform into the network weights, which can be done because they are both linear maps. Because the ReLu function is non-linear, we develop an approximation technique for it. This is a general method and, to our knowledge, is the first attempt at formulating a piecewise linear function in the transform domain presented in the literature.
The contributions of this work are as follows
\begin{enumerate}
\item The general method for expressing convolutional networks in the JPEG domain
\item Concrete formulation for residual blocks to perform classification
\item A model conversion algorithm to apply pretrained spatial domain networks to JPEG images
\item Approximated Spatial Masking: the first general technique for application of piecewise linear functions in the transform domain
\end{enumerate}
By skipping the decompression step and by operating on the sparser compressed format, we show a notable increase in speed for training and inference.