...

Commits (5)
No preview for this file type
 ... ... @@ -8,5 +8,5 @@ \maketitle \begin{abstract} We introduce a general method of performing Residual Network inference and learning in the JPEG transform domain that allows the network to consume compressed images as input. Our formulation leverages the linearity of the JPEG transform to redefine convolution and batch normalization with a tune-able numerical approximation for ReLu. The result is mathematically equivalent to the spatial domain network up to the ReLu approximation accuracy. A formulation for image classification and a model conversion algorithm for spatial domain networks are given as examples of the method. We show that the sparsity of the JPEG format allows for faster processing of the images with little to no penalty in the network accuracy. We introduce a general method of performing Residual Network inference and learning in the JPEG transform domain that allows the network to consume compressed images as input. Our formulation leverages the linearity of the JPEG transform to redefine convolution and batch normalization with a tune-able numerical approximation for ReLu. The result is mathematically equivalent to the spatial domain network up to the ReLu approximation accuracy. A formulation for image classification and a model conversion algorithm for spatial domain networks are given as examples of the method. We show that the sparsity of the JPEG format allows for faster processing of images with little to no penalty in the network accuracy. \end{abstract} \ No newline at end of file
 Spatial, ASM, APX 0.38375, 0.1978899985551834, 0.18126000463962555 0.38375, 0.25641998648643494, 0.2582799792289734 0.38375, 0.29986000061035156, 0.3050000071525574 0.38375, 0.32446998357772827, 0.33258000016212463 0.38375, 0.34317001700401306, 0.35148999094963074 0.38375, 0.35694998502731323, 0.36820995807647705 0.38375, 0.3684000074863434, 0.38540002703666687 0.38375, 0.3816799521446228, 0.3857400119304657 0.38375, 0.3818100094795227, 0.3752700090408325 0.38375, 0.38989001512527466, 0.387939989566803 0.38375, 0.3857700228691101, 0.3895300030708313 0.38375, 0.3902999758720398, 0.38687998056411743 0.38375, 0.3885999917984009, 0.38639000058174133 0.38375, 0.3839299976825714, 0.3863700032234192 0.38375, 0.3861300051212311, 0.38936999440193176
 ... ... @@ -15,10 +15,10 @@ set xtics 1,2,15 set output "relu_training.eps" plot "data/MNIST_relu_training.csv" using ($0+1):1 with lines linestyle spatial notitle, \ "data/CIFAR10_relu_training.csv" using ($0+1):1 with lines linestyle spatial notitle, \ "data/CIFAR100_relu_accuracy.csv" using ($0+1):1 with lines linestyle spatial notitle, \ "data/CIFAR100_relu_training.csv" using ($0+1):1 with lines linestyle spatial notitle, \ "data/MNIST_relu_training.csv" using ($0+1):3 with linespoints linestyle apx_mnist title (columnhead(3)." MNIST"), \ "data/CIFAR10_relu_training.csv" using ($0+1):3 with linespoints linestyle apx_cifar10 title (columnhead(3)." CIFAR10"), \ "data/CIFAR100_relu_accuracy.csv" using ($0+1):3 with linespoints linestyle apx_cifar100 title (columnhead(3)." CIFAR100"), \ "data/CIFAR100_relu_training.csv" using ($0+1):3 with linespoints linestyle apx_cifar100 title (columnhead(3)." CIFAR100"), \ "data/MNIST_relu_training.csv" using ($0+1):2 with linespoints linestyle asm_mnist title (columnhead(2)." MNIST"), \ "data/CIFAR10_relu_training.csv" using ($0+1):2 with linespoints linestyle asm_cifar10 title (columnhead(2)." CIFAR10"), \ "data/CIFAR100_relu_accuracy.csv" using ($0+1):2 with linespoints linestyle asm_cifar100 title (columnhead(2)." CIFAR100") \ No newline at end of file "data/CIFAR100_relu_training.csv" using ($0+1):2 with linespoints linestyle asm_cifar100 title (columnhead(2)." CIFAR100") \ No newline at end of file
This diff is collapsed.
 \section{Conclusion and Future Work} In this work we showed how to formulate deep residual learning in the JPEG transform domain, and we showed that it provides a notable performance benefit in terms of processing time for each image. Our method uses a model of convolutions as a linear map \cite{smith1994fast} and introduces a novel approximation technique for ReLu which, to our knowledge, is the first rigorous attempt at defining a non-linear function in the JPEG transform domain. We showed that the approximation can achieve highly performant results with little impact on the overall classification accuracy. In this work we showed how to formulate deep residual learning in the JPEG transform domain, and that it provides a notable performance benefit in terms of processing time per image. Our method uses expresses convolutions as linear maps \cite{smith1994fast} and introduces a novel approximation technique for ReLu. We showed that the approximation can achieve highly performant results with little impact on classification accuracy. Future work should focus on two main points. The first is efficiency of representation. Our linear maps take up more space, especially when stored in dense tensor data structures, than spatial domain convolutions. This makes it hard to scale the networks to datasets with large image sizes. Secondly, library support in commodity deep learning libraries for some of the features required by this algorithm are lacking. As of this writing, true sparse tensor support is missing in all of PyTorch \cite{paszke2017automatic}, TensorFlow \cite{tensorflow2015-whitepaper}, and Caffe \cite{jia2014caffe}, with these tensors being represented as coordinate lists which are known to be highly non-performant. Additionally, the \texttt{einsum} function for evaluating multilinear expressions is not fully optimized in these libraries when compared to the speed of convolutions in libraries like CuDNN \cite{chetlur2014cudnn}. \ No newline at end of file Future work should focus on two main points. The first is efficiency of representation. Our linear maps take up more space than spatial domain convolutions. This makes it hard to scale the networks to datasets with large image sizes. Secondly, library support in commodity deep learning libraries for some of the features required by this algorithm are lacking. As of this writing, true sparse tensor support is missing in all of PyTorch \cite{paszke2017automatic}, TensorFlow \cite{tensorflow2015-whitepaper}, and Caffe \cite{jia2014caffe}, with these tensors being represented as coordinate lists which are known to be highly non-performant. Additionally, the \texttt{einsum} function for evaluating multilinear expressions is not fully optimized in these libraries when compared to the speed of convolutions in libraries like CuDNN \cite{chetlur2014cudnn}. \ No newline at end of file
 \section{Experiments} We give experimental evidence for the efficacy of our method, starting with a brief discussion of the architectures we use and the datasets for experimentation. We use model conversion as a sanity check, ensuring that the JPEG model with exact ReLu matches exactly the testing accuracy of a spatial domain model. Next we show how the ReLu approximation accuracy effects the overall network performance. We conclude by showing the training and testing time advantage of our method. We give experimental evidence for the efficacy of our method, starting with a discussion of the architectures we use and the datasets. We use model conversion as a sanity check, ensuring that the JPEG model with exact ReLu matches exactly the testing accuracy of a spatial domain model. Next we show how the ReLu approximation accuracy effects overall network performance. We conclude by showing the training and testing time advantage of our method. \subsection{Network Architectures and Datasets} Since we are concerned with reproducing the inference results of spatial domain networks, we choose the MNIST \cite{lecun1998mnist} and CIFAR-10/100 \cite{krizhevsky2009learning} datasets since they are easy to work with. The MNIST images are padded to $32 \times 32$ before being used to ensure an even number of JPEG blocks. Our network architecture is similarly simple is shown in Figure \ref{fig:na}. The classification network consists of three residual blocks with the final two performing downsampling so that the final feature map consists of a single JPEG block. Since we are concerned with reproducing the inference results of spatial domain networks, we choose the MNIST \cite{lecun1998mnist} and CIFAR-10/100 \cite{krizhevsky2009learning} datasets since they are easy to work with. The MNIST images are padded to $32 \times 32$ to ensure an even number of JPEG blocks. Our network architecture is shown in Figure \ref{fig:na}. The classification network consists of three residual blocks with the final two performing downsampling so that the final feature map consists of a single JPEG block. \begin{figure} \centering \includegraphics[width=0.5\linewidth]{figures/network.pdf} \includegraphics[width=\linewidth]{figures/network.pdf} \caption{Simple network architecture. $T$ indicates the batch size.} \label{fig:na} \end{figure} \subsection{Model Conversion} For this first experiment, we provide empirical evidence that the JPEG formulation presented in this paper is mathematically equivalent to spatial domain network. To show this, we train 100 spatial domain models on each of three datasets and give their mean testing accuracies. When then use model conversion to transform the pretrained models to the JPEG domain and give the mean testing accuracies of the JPEG models. The images are losslessly JPEG compressed for input to the JPEG networks and the exact (15 spatial frequency) ReLu formulation is used. The result of this test is given in Table \ref{tab:mc}. Since the accuracy difference between the networks is extremely small, the deviation is also included. For this first experiment, we show empirically that the JPEG formulation is mathematically equivalent to the spatial domain network. To show this, we train 100 spatial domain models on each of the three datasets and give their mean testing accuracies. We then use model conversion to transform the pretrained models to the JPEG domain and give the mean testing accuracies of the JPEG models. The images are losslessly JPEG compressed for input to the JPEG networks and the exact (15 spatial frequency) ReLu formulation is used. The result of this test is given in Table \ref{tab:mc}. Since the accuracy difference between the networks is extremely small, the deviation is also included. \begin{table}[h] \centering ... ... @@ -34,10 +34,11 @@ For this first experiment, we provide empirical evidence that the JPEG formulati \subsection{ReLu Approximation Accuracy} \label{sec:exprla} Next, we examine the impact of the ReLu approximation. We start by examining the raw error on individual $8 \times 8$ blocks. For this test, we take random $4 \times 4$ pixel blocks in the range $[-1, 1]$ and scale them to $8 \times 8$ using a box filter. Fully random $8 \times 8$ blocks do not accurately represent the statistics of real images and are known to be a worst case for the DCT transform. The $4 \times 4$ blocks allow for a large random sample size while still approximating real image statistics. We take 10 million such blocks and compute the average RMSE of our Approximated Spatial Masking (ASM) technique and compare it to computing ReLu directly on the approximation (APX). This test is repeated for all one to fifteen spatial frequencies. The result, shown in Figure \ref{fig:rba} shows that our ASM method gives a better approximation (lower RMSE) through the range of spatial frequencies. Next, we examine the impact of the ReLu approximation. We start by examining the raw error on individual $8 \times 8$ blocks. For this test, we take random $4 \times 4$ pixel blocks in the range $[-1, 1]$ and scale them to $8 \times 8$ using a box filter. Fully random $8 \times 8$ blocks do not accurately represent the statistics of real images and are known to be a worst case for the DCT transform. The $4 \times 4$ blocks allow for a large random sample size while still approximating real image statistics. We take 10 million blocks and compute the average RMSE of our ASM technique and compare it to computing ReLu directly on the approximation (APX). This test is repeated for all one to fifteen spatial frequencies. The result, shown in Figure \ref{fig:rba} shows that our ASM method gives a better approximation (lower RMSE) through the range of spatial frequencies. \begin{figure*} \centering \caption{ReLu accuracy results.} \begin{subfigure}{0.33\textwidth} \captionsetup{width=.8\linewidth} \centering ... ... @@ -61,11 +62,11 @@ Next, we examine the impact of the ReLu approximation. We start by examining the \end{subfigure} \end{figure*} This test provides a strong motivation for the ASM method, so we move on to testing it in the model conversion setting. For this test, we again train 100 spatial domain models and then perform model conversion with the ReLu layers ranging from 1-15 spatial frequencies. We again compare our ASM method with the APX method. The result is given in Figure \ref{fig:ra}, again the ASM method outperforms the APX method. This test provides a strong motivation for the ASM method, so we move on to testing it in the model conversion setting. For this test, we again train 100 spatial domain models and then perform model conversion with the ReLu layers ranging from 1-15 spatial frequencies. We again compare our ASM method with the APX method. The result is given in Figure \ref{fig:ra}. Again the ASM method outperforms the APX method. As a final test, we show that if the models are trained in the JPEG domain, the CNN weights will actually learn to cope with the approximation and fewer spatial frequencies are required to get good accuracy. We again compare ASM to APX in this setting. The result shown in Figure \ref{fig:rt} shows that the ASM method again outperforms the APX method and that the network weights have learned to cope with the approximation. As a final test, we show that if the models are trained in the JPEG domain, the CNN weights will actually learn to cope with the approximation and fewer spatial frequencies are required for good accuracy. The result in Figure \ref{fig:rt} shows that the ASM method again outperforms the APX method and that the network weights have learned to cope with the approximation. \subsection{Efficiency of Training and Testing} ... ... @@ -76,4 +77,4 @@ As a final test, we show that if the models are trained in the JPEG domain, the \label{fig:rt} \end{figure} Finally, we show the throughput for training and testing. For this we test on all three datasets by training and testing a spatial model and training and testing a JPEG model and measuring the time taken. This is then converted to an average throughput measurement. The experiment is performed on an NVIDIA Pascal GPU with a batch size of 40 images. The results, shown in Figure \ref{fig:rt}, show that the JPEG model is able to outperform the spatial model in all cases, but that the performance on training is still limited. This is likely because of the more complex gradient created by the convolution and ReLu operations. At inference time, however, performance is greatly improved over the spatial model. \ No newline at end of file Finally, we show the throughput for training and testing. For this we test on all three datasets by training and testing a spatial model and training and testing a JPEG model and measuring the time taken. This is then converted to an average throughput measurement. The experiment is performed on an NVIDIA Pascal GPU with a batch size of 40 images. The results, shown in Figure \ref{fig:rt}, show that the JPEG model is able to outperform the spatial model in all cases, but that the performance on training is still limited. This is likely caused by the more complex gradient created by the convolution and ReLu operations. At inference time, however, performance is greatly improved over the spatial model. \ No newline at end of file
 ... ... @@ -4,7 +4,7 @@ The popularization of deep learning since the 2012 AlexNet \cite{krizhevsky2012i This problem has been addressed many times in the literature. Batch normalization \cite{ioffe2015batch} is ubiquitous in modern networks to accelerate their convergence. Residual learning \cite{he2016deep} allows for much deeper networks to learn effective mappings without overfitting. Techniques such as pruning and weight compression \cite{han2015deep} are becoming more commonplace. As problems become even larger and more complex, these techniques are increasingly being relied upon for efficient training and inference. We tackle this problem at the level of the image representation. The JPEG image compression algorithm is the most widespread image file format. Traditionally, the first step in using JPEGs for machine learning is to decompress them. We propose to skip this step and instead reformulate the ResNet architecture to perform its operations directly on compressed images. The goal is to produce a new network that is mathematically equivalent to the spatial domain network, but which operates on compressed images by composing the compression transform into the network weights, which can be done because they are both linear maps. Because of the ReLu function is non-linear, we develop an approximation technique for it. This is a general method and, to our knowledge, is the first attempt at formulating a piecewise linear function in the transform domain presented in the literature. We tackle this problem at the level of the image representation. JPEG is the most widespread image file format. Traditionally, the first step in using JPEGs for machine learning is to decompress them. We propose to skip this step and instead reformulate the ResNet architecture to perform its operations directly on compressed images. The goal is to produce a new network that is mathematically equivalent to the spatial domain network, but which operates on compressed images by including the compression transform into the network weights, which can be done because they are both linear maps. Because the ReLu function is non-linear, we develop an approximation technique for it. This is a general method and, to our knowledge, is the first attempt at formulating a piecewise linear function in the transform domain. The contributions of this work are as follows \begin{enumerate} ... ...
This diff is collapsed.
 ... ... @@ -18,70 +18,158 @@ % Pages are numbered in submission mode, and unnumbered in camera-ready \ificcvfinal\pagestyle{empty}\fi \setcounter{page}{4321} \addbibresource{bibliography.bib} \DeclareCaptionFormat{algor}{% \hrulefill\par\offinterlineskip\vskip1pt% \textbf{#1#2}#3\offinterlineskip\hrulefill} \DeclareCaptionStyle{algori}{singlelinecheck=off,format=algor,labelsep=space} \captionsetup[algorithm]{style=algori} \begin{document} \title{Supplementary Material} \maketitle \section{Proof of the DCT Least Squares Theorem} \section{Proof of the DCT Least Squares Approximation Theorem} \begin{theorem}[DCT Least Squares Approximation Theorem] Given a set of $N$ samples of a signal $X = \{x_0, ... x_N\}$, let $Y = \{y_0, ... y_N\}$ be the DCT coefficients of $X$. Then, for any $1 \leq m \leq N$, the approximation p_m(t) = \frac{1}{\sqrt{n}}y_o + \sqrt{\frac{2}{n}}\sum_{k=1}^{m} y_k\cos\left(\frac{k(2t + 1)\pi}{2n}\right) \label{eq:dct1d} of $X$ minimizes the least squared error e_m = \sum_{i=0}^{n} (p_m(i) - x_i)^2 \label{thm:dctls} \end{theorem} \begin{proof} First consider that since Equation \ref{eq:dct1d} represents the Discrete Cosine Transform, which is a Linear map, we can write rewrite it as D^T_my = x where $D_m$ is formed from the first $m$ rows of the DCT matrix, $y$ is a row vector of the DCT coefficients, and $x$ is a row vector of the original samples. To solve for the least squares solution, we use the the normal equations, that is we solve D_mD^T_my = D_mx and since the DCT is an orthonormal transformation, the rows of $D_m$ are orthogonal, so $D_mD^T_m = I$. Therefore y = D_mx Since there is no contradiction, the least squares solution must use the first $m$ DCT coefficients. \end{proof} \section{Proof of the DCT Mean-Variance Theorem} \begin{theorem}[DCT Mean-Variance Theorem] Given a set of samples of a signal $X$ such that $\e[X] = 0$, let $Y$ be the DCT coefficients of $X$. Then \var[X] = \e[Y^2] \end{theorem} \begin{proof} Start by considering $\var[X]$. We can rewrite this as \var[X] = \e[X^2] - \e[X]^2 Since we are given $\e[X] = 0$, this simplifies to \var[X] = \e[X^2] Next, we express the DCT as a linear map such that $X = DY$ and rewrite the previous equation as \var[X] = \e[(DY)^2] Distributing the squaring operation gives \e[(DY)^2] = \e[(D^TD)Y^2] Since $D$ is orthogonal this simplifies to \e[(D^TD)Y^2] = \e[(D^{-1}D)Y^2] = \e[Y^2] \end{proof} \section{Algorithms} \begin{algorithm} \caption{Direct Convolution Explosion. $K$ is an initial filter, $m, n$ are the input and output channels, $h, w$ are the image height and width, $s$ is the stride, $\star_s$ denotes the discrete convolution with stride $s$} \label{alg:dce} \begin{algorithmic} \Function{Explode}{$K, m, n, h, w, s$} \State $d_j \gets \mathbf{shape}(\widetilde{J})$ \State $d_b \gets (d_j[0], d_j[1], d_j[2], 1, h, w)$ \State $\widehat{J} \gets \mathbf{reshape}(\widetilde{J},d_b)$ \State $\widehat{C} \gets \widehat{J} \star_s K$ \State $d_c \gets (m, n, d_j[0], d_j[1], d_j[2], h/s, h/s)$ \State $\widetilde{C} \gets \mathbf{reshape}(\widehat{C}, d_c)$ \State $\mathbf{return} \; \widetilde{C}J$ \EndFunction \end{algorithmic} \end{algorithm} \begin{algorithm} \caption{Automated Spatial Masking for ReLu. $F$ is a DCT domain block, $\phi$ is the desired maximum spatial frequencies, $N$ is the block size.} \label{alg:asmr} \begin{algorithmic} \Function{ReLu}{$F, \phi, N$} \State $M \gets$ \Call{ANNM}{$F, \phi, N$} \State $\mathbf{return}\;$ \Call{ApplyMask}{$F, M$} \EndFunction \Function{ANNM}{$F, \phi, N$} \State $I \gets \mathbf{zeros}(N, N)$ \For{$i \in [0, N)$} \For{$j \in [0, N)$} \For{$\alpha \in [0, N)$} \For{$\beta \in [0, N)$} \If{$\alpha + \beta \leq \phi$} \State $I_{ij} \gets I_{ij} + F_{ij}D^{\alpha\beta}_{ij}$ We conclude by outlining in pseudocode the algorithms for the three layer operations described in the paper. Algorithm \ref{alg:dce} gives the code for convolution explosion, Algorithm \ref{alg:asmr} gives the code for the ASM ReLu approximation, and Algorithm \ref{alg:bn} gives the code for Batch Normalization. \captionof{algorithm}{Convolution Explosion. $K$ is an initial filter, $m, n$ are the input and output channels, $h, w$ are the image height and width, $s$ is the stride, $\star_s$ denotes the discrete convolution with stride $s$} \label{alg:dce} \begin{algorithmic} \Function{Explode}{$K, m, n, h, w, s$} \State $d_j \gets \mathbf{shape}(\widetilde{J})$ \State $d_b \gets (d_j[0], d_j[1], d_j[2], 1, h, w)$ \State $\widehat{J} \gets \mathbf{reshape}(\widetilde{J},d_b)$ \State $\widehat{C} \gets \widehat{J} \star_s K$ \State $d_c \gets (m, n, d_j[0], d_j[1], d_j[2], h/s, h/s)$ \State $\widetilde{C} \gets \mathbf{reshape}(\widehat{C}, d_c)$ \State $\mathbf{return} \; \widetilde{C}J$ \EndFunction \end{algorithmic} \captionof{algorithm}{Approximated Spatial Masking for ReLu. $F$ is a DCT domain block, $\phi$ is the desired maximum spatial frequencies, $N$ is the block size.} \label{alg:asmr} \begin{algorithmic} \Function{ReLu}{$F, \phi, N$} \State $M \gets$ \Call{ANNM}{$F, \phi, N$} \State $\mathbf{return}\;$ \Call{ApplyMask}{$F, M$} \EndFunction \Function{ANNM}{$F, \phi, N$} \State $I \gets \mathbf{zeros}(N, N)$ \For{$i \in [0, N)$} \For{$j \in [0, N)$} \For{$\alpha \in [0, N)$} \For{$\beta \in [0, N)$} \If{$\alpha + \beta \leq \phi$} \State $I_{ij} \gets I_{ij} + F_{ij}D^{\alpha\beta}_{ij}$ \EndIf \EndFor \EndFor \EndFor \EndFor \State $M \gets \mathbf{zeros}(N, N)$ \State $M[I > 0] \gets 1$ \State $\mathbf{return} \; M$ \EndFunction \Function{ApplyMask}{$F, M$} \State $\mathbf{return} \; H^{\alpha\beta ij}_{\alpha'\beta'}F_{\alpha\beta}M_{ij}$ \EndFunction \end{algorithmic} \captionof{algorithm}{Batch Normalization. $F$ is a batch of JPEG blocks (dimensions $N \times 64$), $S$ is the inverse quantization matrix, $m$ is the momentum for updating running statistics, $t$ is a flag that denotes training or testing mode. The parameters $\gamma$ and $\beta$ are stored externally to the function. $\widehat{}\;$ is used to denote a batch statistic and $\tilde{}\;$ is used to denote a running statistic.} \label{alg:bn} \begin{algorithmic} \Function{BatchNorm}{$F$,$S$,$m$,$t$} \If{$t$} \State $\mu \gets \mathbf{mean}(F[:, 0])$ \State $\widehat{\mu} \gets F[:, 0]$ \State $F[:, 0] = 0$ \State $D_g \gets F_kS_k$ \State $\widehat{\sigma^2} \gets \mathbf{mean}(F^2, 1)$ \State $\sigma^2 \gets \mathbf{mean}(\widehat{\sigma^2} + \widehat{\mu}^2) - \mu^2$ \State $\widetilde{\mu} \gets \widetilde{\mu}(1 - m) + \mu m$ \State $\widetilde{\sigma^2} \gets \widetilde{\sigma^2}(1 - m) + \mu m$ \State $F[:, 0] \gets F[:, 0] - \mu$ \State $F \gets \frac{\gamma F}{\sigma}$ \State $F[:, 0] \gets F[:, 0] + \beta$ \Else \State $F[:, 0] \gets F[:, 0] - \widetilde{\mu}$ \State $F \gets \frac{\gamma F}{\widetilde{\sigma}}$ \State $F[:, 0] \gets F[:, 0] + \beta$ \EndIf \EndFor \EndFor \EndFor \EndFor \State $M \gets \mathbf{zeros}(N, N)$ \State $M[I > 0] \gets 1$ \State $\mathbf{return} \; M$ \EndFunction \Function{ApplyMask}{$F, M$} \State $\mathbf{return} \; H^{\alpha\beta ij}_{\alpha'\beta'}F_{\alpha\beta}M_{ij}$ \EndFunction \end{algorithmic} \end{algorithm} \State $\mathbf{return} \; F$ \EndFunction \end{algorithmic} \end{document}