...
 
Commits (9)
......@@ -20,19 +20,21 @@ jpeg_model = models.JpegResNet(spatial_model, n_freqs=6).to(device)
optimizer = optim.Adam(jpeg_model.parameters())
t0 = time.perf_counter()
t0 = time.time()
models.train(jpeg_model, device, dataset[0], optimizer, 0)
torch.cuda.synchronize()
t1 = time.perf_counter()
t1 = time.time()
training_time = t1 - t0
t0 = time.perf_counter()
jpeg_model.explode_all()
t0 = time.time()
models.test(jpeg_model, device, dataset[1])
torch.cuda.synchronize()
t1 = time.perf_counter()
t1 = time.time()
testing_time = t1 - t0
with open('{}_jpeg_throughput.csv'.format(args.dataset), 'w') as f:
f.write('Training, Testing\n')
f.write('{}, {}\n'.format(training_time / len(dataset[0]), testing_time / len(dataset[1])))
f.write('{}, {}\n'.format(len(dataset[0]) / training_time, len(dataset[1]) / testing_time))
......@@ -20,19 +20,19 @@ spatial_model = models.SpatialResNet(dataset_info['channels'], dataset_info['cla
optimizer = optim.Adam(spatial_model.parameters())
t0 = time.perf_counter()
t0 = time.time()
models.train(spatial_model, device, dataset[0], optimizer, 0, do_decode=True)
torch.cuda.synchronize()
t1 = time.perf_counter()
t1 = time.time()
training_time = t1 - t0
t0 = time.perf_counter()
t0 = time.time()
models.test(spatial_model, device, dataset[1], do_decode=True)
torch.cuda.synchronize()
t1 = time.perf_counter()
t1 = time.time()
testing_time = t1 - t0
with open('{}_spatial_throughput.csv'.format(args.dataset), 'w') as f:
f.write('Training, Testing\n')
f.write('{}, {}\n'.format(training_time / len(dataset[0]), testing_time / len(dataset[1])))
f.write('{}, {}\n'.format(len(dataset[0]) / training_time, len(dataset[1]) / testing_time ))
Spatial, ASM, APX
0.72776, 0.48462995886802673, 0.4566900134086609
0.72776, 0.5558299422264099, 0.5400300621986389
0.72776, 0.6149600148200989, 0.6055999994277954
0.72776, 0.6595499515533447, 0.6516900062561035
0.72776, 0.6862999796867371, 0.6746500134468079
0.72776, 0.694159984588623, 0.6993299722671509
0.72776, 0.7129400372505188, 0.7222698926925659
0.72776, 0.719819962978363, 0.7291499972343445
0.72776, 0.7211200594902039, 0.7297700047492981
0.72776, 0.7231199741363525, 0.7259500026702881
0.72776, 0.7296000123023987, 0.728630006313324
0.72776, 0.7300399541854858, 0.7317399978637695
0.72776, 0.7276500463485718, 0.7322199940681458
0.72776, 0.7306400537490845, 0.730009913444519
0.72776, 0.7236000299453735, 0.7270099520683289
Spatial, ASM, APX
0.9875400000000001, 0.9659301042556763, 0.9311500787734985
0.9875400000000001, 0.9693500399589539, 0.9617300033569336
0.9875400000000001, 0.980239987373352, 0.9740599393844604
0.9875400000000001, 0.9879900217056274, 0.9875600934028625
0.9875400000000001, 0.9879299998283386, 0.9904500842094421
0.9875400000000001, 0.9882300496101379, 0.9921500086784363
0.9875400000000001, 0.9884198904037476, 0.9900400042533875
0.9875400000000001, 0.9873700141906738, 0.9901800155639648
0.9875400000000001, 0.9889499545097351, 0.9898198843002319
0.9875400000000001, 0.9887899160385132, 0.9870001077651978
0.9875400000000001, 0.9892199635505676, 0.988290011882782
0.9875400000000001, 0.9879299998283386, 0.9864799380302429
0.9875400000000001, 0.9885400533676147, 0.9889300465583801
0.9875400000000001, 0.9855000376701355, 0.9899399876594543
0.9875400000000001, 0.9884600639343262, 0.9882300496101379
,JPEG Training, JPEG Testing, Spatial Training, Spatial Testing
MNIST, 4.98943192227498, 19.2987381822201, 4.83365493836866, 5.01349337609419
CIFAR10, 4.70592797135485, 17.8355480269352, 4.65057019553071, 4.88444863330153
CIFAR100, 4.70259286114027, 17.8396317186929, 4.62396722578372, 4.79050114439679
\ No newline at end of file
......@@ -3,20 +3,22 @@
load "common.gp"
load "relu_styles.gp"
set size 1.0, 1.2
set ylabel 'Average Accuracy (%)'
set ytics 0.2
set xlabel 'Number of Spatial Frequencies'
set xrange ['1':'15']
set xtics 1
set xrange [1:15]
set xtics 1,2,15
set output "relu_accuracy.eps"
plot "data/MNIST_relu_accuracy.csv" using ($0+1):1 with lines linestyle spatial notitle, \
"data/CIFAR10_relu_accuracy.csv" using ($0+1):1 with lines linestyle spatial notitle, \
"data/CIFAR100_relu_accuracy.csv" using ($0+1):1 with lines linestyle spatial notitle, \
"data/MNIST_relu_accuracy.csv" using ($0+1):2 with linespoints linestyle asm_mnist title (columnhead(2)." MNIST"), \
"data/CIFAR10_relu_accuracy.csv" using ($0+1):2 with linespoints linestyle asm_cifar10 title (columnhead(2)." CIFAR10"), \
"data/CIFAR100_relu_accuracy.csv" using ($0+1):2 with linespoints linestyle asm_cifar100 title (columnhead(2)." CIFAR100"), \
"data/MNIST_relu_accuracy.csv" using ($0+1):3 with linespoints linestyle apx_mnist title (columnhead(3)." MNIST"), \
"data/CIFAR10_relu_accuracy.csv" using ($0+1):3 with linespoints linestyle apx_cifar10 title (columnhead(3)." CIFAR10"), \
"data/CIFAR100_relu_accuracy.csv" using ($0+1):3 with linespoints linestyle apx_cifar100 title (columnhead(3)." CIFAR100")
"data/CIFAR100_relu_accuracy.csv" using ($0+1):3 with linespoints linestyle apx_cifar100 title (columnhead(3)." CIFAR100"), \
"data/MNIST_relu_accuracy.csv" using ($0+1):2 with linespoints linestyle asm_mnist title (columnhead(2)." MNIST"), \
"data/CIFAR10_relu_accuracy.csv" using ($0+1):2 with linespoints linestyle asm_cifar10 title (columnhead(2)." CIFAR10"), \
"data/CIFAR100_relu_accuracy.csv" using ($0+1):2 with linespoints linestyle asm_cifar100 title (columnhead(2)." CIFAR100")
\ No newline at end of file
......@@ -3,11 +3,16 @@
load "common.gp"
load "relu_styles.gp"
set size 1.0, 1.1
set key horizontal
set ylabel 'Average RMSE'
set ytics 0.1
set xlabel 'Number of Spatial Frequencies'
set xrange ['1':'15']
set xtics 1
set xrange [1:15]
set xtics 1,2,15
set output "relu_blocks.eps"
plot "data/relu_blocks.csv" using ($0+1):2 with linespoints linestyle apx_block title columnhead, \
......
set size 1.0, 1.0
# method colors
apx_color = 0.85
asm_color = 0.35
apx_color = "#e69f00"
asm_color = "#009e73"
# dataset point types
mnist_point = 7
......@@ -13,25 +15,25 @@ set style line 1 linewidth 8 dashtype 2 linetype rgb "black"
spatial = 1
# mnist lines
set style line 2 linewidth 8 pointsize 4 pointtype mnist_point palette frac asm_color
set style line 3 linewidth 8 pointsize 4 pointtype mnist_point palette frac apx_color
set style line 2 linewidth 8 pointsize 4 pointtype mnist_point linecolor rgb asm_color
set style line 3 linewidth 8 pointsize 4 pointtype mnist_point linecolor rgb apx_color
asm_mnist = 2
apx_mnist = 3
# cifar10 lines
set style line 4 linewidth 8 pointsize 5 pointtype cifar10_point palette frac asm_color
set style line 5 linewidth 8 pointsize 5 pointtype cifar10_point palette frac apx_color
set style line 4 linewidth 8 pointsize 5 pointtype cifar10_point linecolor rgb asm_color
set style line 5 linewidth 8 pointsize 5 pointtype cifar10_point linecolor rgb apx_color
asm_cifar10 = 4
apx_cifar10 = 5
# cifar100 lines
set style line 6 linewidth 8 pointsize 4 pointtype cifar100_point palette frac asm_color
set style line 7 linewidth 8 pointsize 4 pointtype cifar100_point palette frac apx_color
set style line 6 linewidth 8 pointsize 4 pointtype cifar100_point linecolor rgb asm_color
set style line 7 linewidth 8 pointsize 4 pointtype cifar100_point linecolor rgb apx_color
asm_cifar100 = 6
apx_cifar100 = 7
# block lines
set style line 8 linewidth 8 pointsize 4 pointtype block_point palette frac asm_color
set style line 9 linewidth 8 pointsize 4 pointtype block_point palette frac apx_color
set style line 8 linewidth 8 pointsize 4 pointtype block_point linecolor rgb asm_color
set style line 9 linewidth 8 pointsize 4 pointtype block_point linecolor rgb apx_color
asm_block = 8
apx_block = 9
\ No newline at end of file
#!/usr/bin/gnuplot -c
load "common.gp"
load "relu_styles.gp"
set size 1.0, 1.2
set ylabel 'Average Accuracy (%)'
set ytics 0.2
set xlabel 'Number of Spatial Frequencies'
set xrange [1:15]
set xtics 1,2,15
set output "relu_training.eps"
plot "data/MNIST_relu_training.csv" using ($0+1):1 with lines linestyle spatial notitle, \
"data/CIFAR10_relu_training.csv" using ($0+1):1 with lines linestyle spatial notitle, \
"data/CIFAR100_relu_accuracy.csv" using ($0+1):1 with lines linestyle spatial notitle, \
"data/MNIST_relu_training.csv" using ($0+1):3 with linespoints linestyle apx_mnist title (columnhead(3)." MNIST"), \
"data/CIFAR10_relu_training.csv" using ($0+1):3 with linespoints linestyle apx_cifar10 title (columnhead(3)." CIFAR10"), \
"data/CIFAR100_relu_accuracy.csv" using ($0+1):3 with linespoints linestyle apx_cifar100 title (columnhead(3)." CIFAR100"), \
"data/MNIST_relu_training.csv" using ($0+1):2 with linespoints linestyle asm_mnist title (columnhead(2)." MNIST"), \
"data/CIFAR10_relu_training.csv" using ($0+1):2 with linespoints linestyle asm_cifar10 title (columnhead(2)." CIFAR10"), \
"data/CIFAR100_relu_accuracy.csv" using ($0+1):2 with linespoints linestyle asm_cifar100 title (columnhead(2)." CIFAR100")
\ No newline at end of file
#!/usr/bin/gnuplot -c
load "common.gp"
set boxwidth 0.5
set style fill solid
set yrange [0:20]
set ylabel 'Throughput (Images/Sec)'
set style data histogram
set style histogram cluster gap 1
set boxwidth 0.9
set xtics scale 0
set output "throughput.eps"
plot "data/throughput.csv" using 2:xtic(1), \
"data/throughput.csv" using 3, \
"data/throughput.csv" using 4, \
"data/throughput.csv" using 5
......@@ -40,32 +40,32 @@ Theorem \ref{thm:dctls} states that a reconstruction using the $m$ lowest spatia
A key observation of the JPEG algorithm, and the foundation of most compressed domain processing methods \cite{chang1992video, chang1993new, natarajan1995fast, shen1995inner, shen1996direct, shen1998block, smith1993algorithms, smith1994fast} is that steps 1-4 of the JPEG compression algorithm are linear maps, so they can be composed, along with other linear operations, into a single linear map which performs the operations on the compressed representation. Step 5, the rounding step, cannot be undone and Step 6, the entropy coding, is nonlinear and therefore must be undone. We define the JPEG Transform Domain as the output of Step 4 in the JPEG encoding algorithm. Inputs the the algorithms described here will be JPEGs after reversing the entropy coding.
Formally, we model a single plane image as the type (0, 2) tensor $I \in V^* \otimes V^*$ for some vector space $V$ and its dual $V^*$. Note that we take all tensors as a tensor product combination of $V$ and $V^*$ without loss of generality. In real images, the dimensions have physical meaning (\eg width and height of the image) and will be of different sizes. The basis of $V$ is always the standard orthonormal basis, this is important as it allows the free raising and lowering of indices without the use of a metric tensor.
We define the JPEG transform $J \in V \otimes V \otimes V^* \otimes V^* \otimes V^*$ $J$ represents a linear map $J: V^* \otimes V^* \rightarrow V^* \otimes V^* \otimes V^*$ which is computed as (in Einstein notation)
Formally, we model a single plane image as the type (0, 2) tensor $I \in H^* \otimes W^*$ where $H$ and $W$ are vector spaces and $*$ denotes the dual space. We always the standard orthonormal basis for these vector spaces, this is important as it allows the free raising and lowering of indices without the use of a metric tensor.
We define the JPEG transform $J \in H \otimes W \otimes X^* \otimes Y^* \otimes K^*$. $J$ represents a linear map $J: H^* \otimes W^* \rightarrow X^* \otimes Y^* \otimes K^*$ which is computed as (in Einstein notation)
\begin{equation}
I'_{xyk} = J^{sr}_{xyk}I_{sr}
I'_{xyk} = J^{hw}_{xyk}I_{hw}
\end{equation}
and we say that $I'$ is the representation of $I$ in the JPEG transform domain. In the above equation, the indices $s,r$ give the pixel position, the indices $x,y$ give the block position, and the index $k$ gives the offset into the block.
and we say that $I'$ is the representation of $I$ in the JPEG transform domain. In the above equation, the indices $h,w$ give the pixel position, the indices $x,y$ give the block position, and the index $k$ gives the offset into the block.
The form of $J$ is constructed from the JPEG compression steps listed in the previous section. Let the linear map $B: V^* \otimes V^* \rightarrow V^* \otimes V^* \otimes V^* \otimes V^*$ be defined as
The form of $J$ is constructed from the JPEG compression steps listed in the previous section. Let the linear map $B: H^* \otimes W^* \rightarrow X^* \otimes Y^* \otimes I^* \otimes J^*$ be defined as
\begin{equation}
B^{sr}_{xyij} = \left\{ \begin{array}{lr} 1 & \text{$s,r$ belongs in block $x,y$ at offset $i,j$} \\ 0 & \text{otherwise} \end{array} \right.
B^{hw}_{xyij} = \left\{ \begin{array}{lr} 1 & \text{$h,w$ belongs in block $x,y$ at offset $i,j$} \\ 0 & \text{otherwise} \end{array} \right.
\end{equation}
then $B$ can be used to break the image represented by $I$ into blocks of a given size such that the first two indices $x,y$ index the block position and the last two indices $i,j$ index the offset into the block.
Next. let the linear map $D: V^* \otimes V^* \rightarrow V^* \otimes V^*$ be defined as
Next. let the linear map $D: I^* \otimes J^* \rightarrow A^* \otimes B^*$ be defined as
\begin{align}
D^{ij}_{\alpha\beta} = \frac{1}{4}A(\alpha)A(\beta)\cos\left(\frac{(2i+1)\alpha\pi}{16}\right)\cos\left(\frac{(2j+1)\beta\pi}{16}\right)
\end{align}
then $D$ represents the 2D discrete forward (and inverse) DCT. Let $Z: V^* \otimes V^* \rightarrow V^*$ be defined as
then $D$ represents the 2D discrete forward (and inverse) DCT. Let $Z: A^* \otimes B^* \rightarrow \Gamma^*$ be defined as
\begin{equation}
Z^{\alpha\beta}_\gamma = \left\{ \begin{array}{lr} 1 & \text{$\alpha, \beta$ is at $\gamma$ under zigzag ordering} \\ 0 & \text{otherwise} \end{array} \right.
\end{equation}
then $Z$ creates the zigzag ordered vectors. Finally, let $S: V^* \rightarrow V^*$ be
then $Z$ creates the zigzag ordered vectors. Finally, let $S: \Gamma^* \rightarrow K^*$ be
\begin{equation}
S^\gamma_k = \frac{1}{q_k}
......@@ -75,7 +75,7 @@ where $q_k$ is a quantization coefficient, $S$ can be used to scale the vector e
With linear maps for each step of the JPEG transform, we can then apply them to each other to create the $J$ tensor that was described at the beginning of this section
\begin{equation}
J^{sr}_{xyk} = B^{sr}_{xyij}D^{ij}_{\alpha\beta}Z^{\alpha\beta}_{\gamma}S^\gamma_k
J^{hw}_{xyk} = B^{hw}_{xyij}D^{ij}_{\alpha\beta}Z^{\alpha\beta}_{\gamma}S^\gamma_k
\end{equation}
The inverse mapping also exists as a tensor $\widetilde{J}$ which can be defined using the same linear maps with the exception of $S$. Let $\widetilde{S}$ be
......@@ -86,14 +86,14 @@ The inverse mapping also exists as a tensor $\widetilde{J}$ which can be defined
Then
\begin{equation}
\widetilde{J}^{xyk}_{sr} = B_{sr}^{xyij}D_{ij}^{\alpha\beta}Z_{\alpha\beta}^{\gamma}\widetilde{S}^k_\gamma
\widetilde{J}^{xyk}_{hw} = B_{hw}^{xyij}D_{ij}^{\alpha\beta}Z_{\alpha\beta}^{\gamma}\widetilde{S}^k_\gamma
\end{equation}
noting that, for all tensors other than $\widetilde{S}$, we have freely raised and lowered indices without the use of a metric tensor on $V$ since we consider only the standard orthonormal basis, as stated earlier.
noting that, for all tensors other than $\widetilde{S}$, we have freely raised and lowered indices without the use of a metric tensor since we consider only the standard orthonormal basis, as stated earlier.
Next consider a linear map $C: V^* \otimes V^* \rightarrow V^* \otimes V^*$ which performs an arbitrary pixel manipulation on an image plane $I$. To apply this mapping to a JPEG image $I'$, we would first decompress the image, apply $C$ to the result, then compress that result to get the final JPEG. Since compressing is an application of $J$ and decompressing is an application of $\widetilde{J}$, we can form a new linear map $\Xi: V^* \otimes V^* \otimes V^* \rightarrow V^* \otimes V^* \otimes V^*$ as
Next consider a linear map $C: H^* \otimes W^* \rightarrow H^* \otimes W^*$ which performs an arbitrary pixel manipulation on an image plane $I$. To apply this mapping to a JPEG image $I'$, we would first decompress the image, apply $C$ to the result, then compress that result to get the final JPEG. Since compressing is an application of $J$ and decompressing is an application of $\widetilde{J}$, we can form a new linear map $\Xi: X^* \otimes Y^* \otimes K^* \rightarrow X^* \otimes Y^* \otimes K^*$ as
\begin{equation}
\label{eq:stoj}
\Xi^{xyk}_{x'y'k'} = \widetilde{J}^{xyk}_{sr}C^{sr}_{s'r'}J^{s'r'}_{x'y'k'}
\Xi^{xyk}_{x'y'k'} = \widetilde{J}^{xyk}_{hw}C^{hw}_{h'w'}J^{h'w'}_{x'y'k'}
\end{equation}
which applies $C$ in the JPEG transform domain. There are two important points to note about $\Xi$. The first is that, although it encapsulates decompression, applying $C$ and compressing, it uses far fewer operations than doing these processes separately since the coefficients are multiplied out. The second is that it is mathematically equivalent to performing $C$ on the decompressed image and compressing the result, it is not an approximation.
\ No newline at end of file
......@@ -18,6 +18,7 @@ Since we are concerned with reproducing the inference results of spatial domain
For this first experiment, we provide empirical evidence that the JPEG formulation presented in this paper is mathematically equivalent to spatial domain network. To show this, we train 100 spatial domain models on each of three datasets and give their mean testing accuracies. When then use model conversion to transform the pretrained models to the JPEG domain and give the mean testing accuracies of the JPEG models. The images are losslessly JPEG compressed for input to the JPEG networks and the exact (15 spatial frequency) ReLu formulation is used. The result of this test is given in Table \ref{tab:mc}. Since the accuracy difference between the networks is extremely small, the deviation is also included.
\begin{table}[h]
\centering
\begin{tabular}{|r|l|l|l|}
\hline
Dataset & Spatial & JPEG & Deviation \\ \hline
......@@ -33,24 +34,46 @@ For this first experiment, we provide empirical evidence that the JPEG formulati
\subsection{ReLu Approximation Accuracy}
\label{sec:exprla}
Next, we with to examine the impact of the ReLu approximation. We start by examining the raw error on individual $8 \times 8$ blocks. For this test, we take random $4 \times 4$ pixel blocks in the range $[-1, 1]$ and scale them to $8 \times 8$ using a box filter. Fully random $8 \times 8$ blocks do not accurately represent the statistics of real images and are known to be a worst case for the DCT transform. The $4 \times 4$ blocks allow for a large random sample size while still approximating real image statistics. We take 10 million such blocks and compute the average RMSE of our Approximated Spatial Masking (ASM) technique and compare it to computing ReLu directly on the approximation (APX). This test is repeated for all one to fifteen spatial frequencies. The result, shown in Figure \ref{fig:rba} shows that our ASM method gives a better approximation (lower RMSE) through the range of spatial frequencies.
Next, we examine the impact of the ReLu approximation. We start by examining the raw error on individual $8 \times 8$ blocks. For this test, we take random $4 \times 4$ pixel blocks in the range $[-1, 1]$ and scale them to $8 \times 8$ using a box filter. Fully random $8 \times 8$ blocks do not accurately represent the statistics of real images and are known to be a worst case for the DCT transform. The $4 \times 4$ blocks allow for a large random sample size while still approximating real image statistics. We take 10 million such blocks and compute the average RMSE of our Approximated Spatial Masking (ASM) technique and compare it to computing ReLu directly on the approximation (APX). This test is repeated for all one to fifteen spatial frequencies. The result, shown in Figure \ref{fig:rba} shows that our ASM method gives a better approximation (lower RMSE) through the range of spatial frequencies.
\begin{figure}
\includegraphics[width=\linewidth]{plots/relu_blocks.eps}
\caption{ReLu blocks accuracy. Our ASM method consistently gives lower error than the naive approximation method.}
\label{fig:rba}
\end{figure}
\begin{figure*}
\centering
\begin{subfigure}{0.33\textwidth}
\captionsetup{width=.8\linewidth}
\centering
\includegraphics[width=\textwidth]{plots/relu_blocks.eps}
\caption{ReLu blocks error. Our ASM method consistently gives lower error than the naive approximation method.}
\label{fig:rba}
\end{subfigure}%
\begin{subfigure}{0.33\textwidth}
\captionsetup{width=.8\linewidth}
\centering
\includegraphics[width=\textwidth]{plots/relu_accuracy.eps}
\caption{ReLu model conversion accuracy. ASM again outperforms the naive approximation. The spatial domain accuracy is given for each dataset with dashed lines.}
\label{fig:ra}
\end{subfigure}%
\begin{subfigure}{0.33\textwidth}
\captionsetup{width=.8\linewidth}
\centering
\includegraphics[width=\textwidth]{plots/relu_training.eps}
\caption{ReLu training accuracy. The network weights have learned to correct for the ReLu approximation allowing fewer spatial frequencies to be used for high accuracy.}
\label{fig:rt}
\end{subfigure}
\end{figure*}
This test provides a strong motivation for the ASM method, so we move on to testing it in the model conversion setting. For this test, we again train 100 spatial domain models and then perform model conversion with the ReLu layers ranging from 1-15 spatial frequencies. We again compare our ASM method with the APX method. The result is given in Figure \ref{fig:ra}, again the ASM method outperforms the APX method.
\begin{figure}
\includegraphics[width=\linewidth]{plots/relu_accuracy.eps}
\caption{ReLu model conversion accuracy. ASM again outperforms the naive approximation. The spatial domain accuracy is given for each dataset with dashed lines.}
\label{fig:ra}
\end{figure}
As a final test, we show that if the models are trained in the JPEG domain, the CNN weights will actually learn to cope with the approximation and fewer spatial frequencies are required to get good accuracy. We again compare ASM to APX in this setting.
As a final test, we show that if the models are trained in the JPEG domain, the CNN weights will actually learn to cope with the approximation and fewer spatial frequencies are required to get good accuracy. We again compare ASM to APX in this setting. The result shown in Figure \ref{fig:rt} shows that the ASM method again outperforms the APX method and that the network weights have learned to cope with the approximation.
\subsection{Efficiency of Training and Testing}
\TODO simple test here, show averaged timing results for training and testing both datasets, then show images/sec for inference for both models. Try to compute number of operations on average by measuring sparsity (???)
\ No newline at end of file
\begin{figure}[b]
\includegraphics[width=\linewidth]{plots/throughput.eps}
\caption{Throughput. The JPEG model has a more complex gradient which limits speed improvement during training. Inference, however, sees considerably higher throughput.}
\label{fig:rt}
\end{figure}
Finally, we show the throughput for training and testing. For this we test on all three datasets by training and testing a spatial model and training and testing a JPEG model and measuring the time taken. This is then converted to an average throughput measurement. The experiment is performed on an NVIDIA Pascal GPU with a batch size of 40 images. The results, shown in Figure \ref{fig:rt}, show that the JPEG model is able to outperform the spatial model in all cases, but that the performance on training is still limited. This is likely because of the more complex gradient created by the convolution and ReLu operations. At inference time, however, performance is greatly improved over the spatial model.
\ No newline at end of file
\section{JPEG Domain Residual Networks}
The ResNet architecture, generally, consists of blocks of four basic operations: Convolution (potentially strided), ReLu, Batch Normalization, and Component-wise addition, with the blocks terminating with a global average pooling operation \cite{he2016deep} before a fully connected layer performs the final classification. Our goal will be to develop JPEG domain equivalents to these five operations. Network activations are given as a single tensor holding a batch of multi-channel images, that is $I \in V^* \otimes V^* \otimes V^* \otimes V^*$.
The ResNet architecture, generally, consists of blocks of four basic operations: Convolution (potentially strided), ReLu, Batch Normalization, and Component-wise addition, with the blocks terminating with a global average pooling operation \cite{he2016deep} before a fully connected layer performs the final classification. Our goal will be to develop JPEG domain equivalents to these five operations. Network activations are given as a single tensor holding a batch of multi-channel images, that is $I \in N^* \otimes C^* \otimes H^* \otimes W^*$.
\subsection{Convolution}
The convolution operation follows directly from the discussion in Section \ref{sec:backjlm}. The convolution operation in the spatial domain is a shorthand notation for a linear map $C: V^* \otimes V^* \otimes V^* \otimes V^* \rightarrow V^* \otimes V^* \otimes V^* \otimes V^*$. Since the same operation is applied to each image in the batch, we can represent $C$ with a type (3, 3) tensor. The entries of this tensor give the coefficient for a given pixel in a given input channel for each pixel in each output channel. This notation can describe any multichannel linear pixel manipulation. We now develop the algorithm for representing discrete convolutional filters using this data structure.
The convolution operation follows directly from the discussion in Section \ref{sec:backjlm}. The convolution operation in the spatial domain is a shorthand notation for a linear map $C: N^* \otimes C^* \otimes H^* \otimes W^* \rightarrow N^* \otimes C^* \otimes H^* \otimes W^*$. Since the same operation is applied to each image in the batch, we can represent $C$ with a type (3, 3) tensor. The entries of this tensor give the coefficient for a given pixel in a given input channel for each pixel in each output channel. This notation can describe any multichannel linear pixel manipulation. We now develop the algorithm for representing discrete convolutional filters using this data structure.
A naive algorithm can simply copy randomly initialized convolution weights into this larger structure following the formula for a 2D discrete cross-correlation and then apply the JPEG compression and decompression tensors to the result, but this is difficult to parallelize and incurs additional memory overhead to store the spatial domain operation. A more efficient algorithm would produce the JPEG domain operation directly and be easy to express as a compute kernel for a GPU. Start by considering the JPEG decompression tensor $\widetilde{J}$. Note that since $\widetilde{J} \in V \otimes V \otimes V \otimes V^* \otimes V^*$ the last two indices of $\widetilde{J}$ form single channel image under our image model (\eg the last two indices are in $V^* \otimes V^*$). If the convolution can be applied to this "image", then the resulting map would decompress and convolve simultaneously. We can formulate a new tensor $\widehat{J} \in V \otimes V^* \otimes V^*$
A naive algorithm can simply copy randomly initialized convolution weights into this larger structure following the formula for a 2D discrete cross-correlation and then apply the JPEG compression and decompression tensors to the result, but this is difficult to parallelize and incurs additional memory overhead to store the spatial domain operation. A more efficient algorithm would produce the JPEG domain operation directly and be easy to express as a compute kernel for a GPU. Start by considering the JPEG decompression tensor $\widetilde{J}$. Note that since $\widetilde{J} \in X \otimes Y \otimes K \otimes H^* \otimes W^*$ the last two indices of $\widetilde{J}$ form single channel image under our image model (\eg the last two indices are in $H^* \otimes W^*$). If the convolution can be applied to this "image", then the resulting map would decompress and convolve simultaneously. We can formulate a new tensor $\widehat{J} \in N \otimes H^* \otimes W^*$
by reshaping $\widetilde{J}$ and treating this as a batch of images, then, given the initialized convolution filter, $K$ computing
\begin{equation}
\widehat{C}^b = \widehat{J}^b \star K
\end{equation}
gives us the desired map. After reshaping $\widehat{C}$ back to the original shape of $\widetilde{J}$ to give $\widetilde{C}$, the full compressed domain operation can be expressed as
\begin{equation}
\Xi^{mxyk}_{nx'y'k'} = \widetilde{C}^{mxyk}_{nsr}J^{sr}_{x'y'k'}
\Xi^{cxyk}_{c'x'y'k'} = \widetilde{C}^{cxyk}_{c'hw}J^{hw}_{x'y'k'}
\end{equation}
where $m$ and $n$ index the input and output channels of the image respectively. This algorithm skips the overhead of computing the spatial domain map explicitly and depends only on the batch convolution operation which is available in all GPU accelerated deep learning libraries. The algorithm is shown in the supplementary material.
......@@ -43,7 +43,7 @@ What we can do to solve the first problem is to look at the ranges that the line
The final problem is that we now have a mask in the spatial domain, but the original image is in the DCT domain. There is a well known algorithm for pixelwise multiplication of two DCT images \cite{smith1993algorithms}, but it would require the mask to also be in the DCT domain. Fortunately, there is a straightforward solution that comes as a result of the multilinear analysis given in Section \ref{sec:backjlm}. Consider the bilinear map
\begin{equation}
H: V^* \otimes V^* \times V^* \otimes V^* \rightarrow V^* \otimes V^*
H: A^* \otimes B^* \times I^* \otimes J^* \rightarrow A^* \otimes B^*
\end{equation}
that takes a DCT block, $F$, and a mask $M$, and produces the masked DCT block by pixelwise multiplication. Our task will be to derive the form of $H$. We proceed by construction. The steps of such an algorithm naively would be
\begin{enumerate}
......
......@@ -5,11 +5,11 @@ File=supplement.pdf
[Session]
Bookmarks=@Invalid()
CurrentFile=supplement.tex
File0\Col=19
File0\Col=49
File0\EditorGroup=0
File0\FileName=supplement.tex
File0\FirstLine=15
File0\FirstLine=6
File0\FoldedLines=
File0\Line=34
File0\Line=15
FileVersion=1
MasterFile=supplement.tex