End-to-end Optimized Image Compression

Johannes Ballé, Valero Laparra & Eero P. Simoncelli

This website accompanies our ICLR-2017 article, End-to-end Optimized Image Compression, available on arXiv.org

We've developed a transform coder, constructed using three stages of linear–nonlinear transformation. Each stage of the analysis (encoding) transform is constructed from a subsampled convolution with 128 filters (192 or 256 filters for RGB models and high bit rates, respectively), whose responses are then divided by a weighted L2-norm (square root of a sum of squares plus a constant) of all other filter responses at that spatial location. This local normalization nonlinearity is inspired by local gain control behaviors observed in biological visual systems (see article for references). Note that this is not the same as "batch normalization" (which is spatially global, applied only during training, and typically includes mean subtraction) or "instance normalization" (which is also spatially global). Convolution filters are square with sizes {9,5,5}, and results are subsampled by factors of {4,2,2}, in both dimensions, at the {1st,2nd,3rd} stages, respectively. The transformed values are uniformly quantized and converted to a bit stream using an arithmetic entropy coder. The synthesis (decoding) transform mirrors the analysis transform in reverse order (weighted L2-norms are used as multipliers rather than divisors, and convolution is performed after upsampling), with distinct parameters (filter weights and normalization parameters).

All parameters (convolution kernels, normalization weights and constants for all analysis and synthesis stages) are optimized for a weighted sum of bit rate (entropy) and distortion (mean squared error of the decoded image), using stochastic gradient descent, over a large set of training images from the ImageNet database. This is performed for different values of the weight (lambda), yielding coder parameters at different points along the rate–distortion tradeoff. To make the objective function continuous, we substitute additive uniform noise for the quantization step during the optimization (see article).

This system shares many features in common with deep convolutional networks but note that: (1) it contains only three stages; (2) the gain control nonlinearities are jointly computed, rather than a scalar rectification or sigmoidal function; (3) the parameters are optimized for a combination of bitrate and reconstruction error, rather then for recognition error. This is an unsupervised learning process - it requires a large set of representative images but they are unlabelled and uncategorized - and is closely related to training of variational autoencoders (see article).

Example images, compressed at many different rates, are provided below, along with results obtained with JPEG and JPEG2000. Although our coder was optimized for mean squared error, we find the compressed images more natural in appearance than those compressed with JPEG or JPEG 2000, both of which suffer from the severe artifacts commonly seen in linear transform coding methods. We also provide numerical comparisons (both PSNR and MS-SSIM, as a function of bitrate), which demonstrate an improvement over JPEG and JPEG2000 for most images and bitrates.

None of the test images shown below were included in the training set, and all compressed images and numerical values (bit rate R, PSNR, and MS-SSIM) are computed using the learned transformations, uniform scalar quantization, and the entropy-coded bit stream. We compare our method to JPEG with 4:2:0 chroma subsampling, and to the OpenJPEG implementation of JPEG 2000 with the default "multiple component transform". For evaluating PSNR, we use the JPEG-defined conversion matrix to convert between RGB and Y'CbCr. For evaluating MS-SSIM, we used only the luma component.

Source code and pre-trained model parameters

Matlab source code for the analysis/synthesis stages (for grayscale images only): [ compress.m, uncompress.m ]. Test code is provided at the bottom of the compress function. Both functions rely on the matlabPyrTools toolbox, which provides convolution and display code.

Note: these functions differ slightly from the original Python implementation. In particular, boundary-handling is slightly different, and the the CABAC arithmetic entropy coder is not included (the "compress" function returns the quantized code values). Nevertheless, entropy estimates of bitrate, as well as the PSNR of the decoded images, are very close to those provided in the article and in the examples given below.

To use the matlab code, you will also need to download the pre-trained model parameters (the current implementation only handles grayscale, but parameters for rgb are also provided). Each folder contains subfolders corresponding to the values of λ, each specifying a point along the rate–distortion curve (larger values for higher quality). Within each subfolder, there are two files containing the parameters for each stage of the analysis/synthesis transforms: one in Matlab format, and one as a Python pickled dictionary. Each file contains the parameters, as defined in equations (1) to (6) in the article. Specifically:

h: a 4-D array (tensor) containing the linear filter kernels for that stage. For the analysis(synthesis) stages, the first dimension corresponds to the output(input) channel, the second to the input(output) channel, the third and forth to the (y,x) spatial position within the filters.
c: a vector containing the bias parameters (additive constant) for each channel.
strides: the (y,x) downsampling/upsampling factors, used for all channels within that stage.
beta: a vector containing the constant β used in the denominator of the normalization function for each channel.
gamma: a 2-D array containing the weights γ used in the denominator of the normalization function. The dimensions enumerate the input/output channels, but the order is inconsequential since the matrix is constrained to be symmetric.

Note: the last analysis stage (analysis-03) and the first synthesis stage (synthesis-00) contain only the bias parameters (c), which serve to recenter the code space values in preparation for integer quantization (i.e., the "round" function).

Python code, including Tensorflow localNormalization library, and training code

[8/2018] Local normalization nonlinearities have been added as an extension to tensorFlow: https://tensorflow.github.io/compression/docs/api_docs/python/tfc/GDN.html.
Compression code is here: https://github.com/tensorflow/compression.

Commonly used test images

Click on image to see compression results.

Kodak image set

Downloaded from here. Click on image to see compression results. Note: We removed 8 pixels from the borders of the original images, since these contained artifacts.

Additional example images (mostly our own)

Click on image to see compression results.