Reduce bandwidth usage of PNG transfers over HTTP with this one weird trick

Summary

Basically the idea is: If the browser says it supports Content-Encoding: br, don't compress PNG IDAT chunks (more precisely, store them all with DEFLATE Non-compressed blocks) and compress the whole file with Brotli instead.

This was a one-day weekend hobby project; it has nothing to do with my employer, and it's probably safe to consider this project "abandoned as soon as it was put on the web". I have no idea how this impacts things like browser RAM usage. I have, in general, not really properly evaluated this in any way.

Explanation

PNG files consist almost entirely of a big DEFLATE-compressed block (see RFC 2083, section 5), plus some metadata. This DEFLATE-compressed block contains all the scanlines (in other words, rows of pixels) that make up the image, with each scanline preprocessed using one of five different filters which don't change the size of data but encode it differently (see RFC 2083, section 6), which is intended to make the data more easily compressible with DEFLATE.

HTTP can reduce the size of response bodies using the Content-Encoding mechanism. There are two main compression algorithms used with that mechanism:

The usual advice is that you shouldn't be using this for things like PNG images because they have already been compressed; but actually, since a PNG image is almost entirely just a DEFLATE-compressed file, we can rewrite the DEFLATE stream inside the PNG file to just store data verbatim without compression and then apply Brotli compression at the HTTP layer.

In terms of HTTP bandwidth usage, this is almost equivalent to swapping out the compression mechanism inside PNG (lots of handwaving here), except that this method just works in browsers today, without requiring any custom JavaScript hackery. (You might want to add some server-side hackery though so that this hack is only used for browsers that support Brotli while shipping normal PNGs to other browsers? Or something like that? IDK.)

Really crappy evaluation

Wikipedia logo

The Wikipedia logo (from https://en.wikipedia.org/static/images/project-logos/enwiki.png) is pretty well-optimized already; Pngcrush and ZopfliPNG at default settings both actually would generate bigger images:

$ pngcrush enwiki.png Recompressing IDAT chunks in enwiki.png to pngout.png Total length of data found in critical chunks = 8463 Best pngcrush method = 7 (ws 15 fm 0 zl 9 zs 0) = 8923 CPU time decode 0.001856, encode 0.013436, other 0.001170, total 0.018536 sec $ zopflipng --always_zopflify enwiki.png enwiki.zopfli.png Optimizing enwiki.png Input size: 8578 (8K) Result size: 8689 (8K). Percentage of original: 101.294% Original was smaller
But if we take the original image, undo the DEFLATE compression, and add brotli around it:
$ ../uncompressed_png enwiki.png bits per (uncompressed) pixel: 8 bytes per scanline (without prefix): 135 inflated IDAT size: 21080 $ brotli enwiki.png.uncompressed $ ls -l enwiki.png{,.uncompressed.br} -rw-r--r-- 1 jann jann 8578 3. Sep 2020 enwiki.png -rw-r--r-- 1 jann jann 8381 3. Mai 04:13 enwiki.png.uncompressed.br
Then we can reduce its size by ~2%! (live demo)

Okay, sure, that's not super impressive. The next example looks a bit better.

Webcomic strip

Here I'm testing with the image from this webcomic.

$ pngcrush 2013-02-07-0453-cassandra.png Recompressing IDAT chunks in 2013-02-07-0453-cassandra.png to pngout.png Total length of data found in critical chunks = 59448 Best pngcrush method = 4 (ws 15 fm 0 zl 9 zs 1) = 57873 CPU time decode 0.008674, encode 0.203350, other 0.001556, total 0.217217 sec $ zopflipng 2013-02-07-0453-cassandra.png 2013-02-07-0453-cassandra-zopfli.png Optimizing 2013-02-07-0453-cassandra.png Input size: 59469 (58K) Result size: 54276 (53K). Percentage of original: 91.268% Result is smaller $ ../uncompressed_png 2013-02-07-0453-cassandra.png bits per (uncompressed) pixel: 8 bytes per scanline (without prefix): 980 inflated IDAT size: 322749 $ brotli 2013-02-07-0453-cassandra.png.uncompressed -o 2013-02-07-0453-cassandra.png.br $ ../uncompressed_png pngout.png bits per (uncompressed) pixel: 8 bytes per scanline (without prefix): 980 inflated IDAT size: 322749 $ brotli pngout.png.uncompressed -o pngout.png.br $ ../uncompressed_png 2013-02-07-0453-cassandra-zopfli.png bits per (uncompressed) pixel: 4 bytes per scanline (without prefix): 490 inflated IDAT size: 161539 $ brotli 2013-02-07-0453-cassandra-zopfli.png.uncompressed -o 2013-02-07-0453-cassandra-zopfli.png.br $ ls -l --sort=size *.png *.br -rw-r--r-- 1 jann jann 59469 4. Feb 2013 2013-02-07-0453-cassandra.png -rw-r--r-- 1 jann jann 57894 3. Mai 04:56 pngout.png -rw-r--r-- 1 jann jann 54276 3. Mai 04:56 2013-02-07-0453-cassandra-zopfli.png -rw-r--r-- 1 jann jann 51461 3. Mai 05:06 2013-02-07-0453-cassandra-zopfli.png.br -rw-r--r-- 1 jann jann 50024 3. Mai 04:58 2013-02-07-0453-cassandra.png.br -rw-r--r-- 1 jann jann 50024 3. Mai 04:59 pngout.png.br
That's a ~7.8% size reduction from what ZopfliPNG gave us.

Crazier and useless idea: double-DEFLATE for Huffman table deduplication?

If we didn't have brotli, we could maybe abuse the two layers of DEFLATE to build a shoddy imitation of switchable Huffman dictionaries - use the inner DEFLATE layer for normal Huffman encoding and LZ77, but ensure that DEFLATE blocks are aligned to whole bytes (by padding with empty blocks?), and then use the outer layer of DEFLATE to deduplicate the inner layer's Huffman tables with LZ77? Then you could maybe more easily switch between Huffman codes for every scanline without paying too much space for Huffman dictionaries on every scanline.

I haven't tested or evaluated this double-DEFLATE idea in any way, just thought I'd write it down here... It probably doesn't matter in practice since most places support Brotli by now anyway.

Code

This is the C++ code I used to test this - it takes a PNG image and rewrites the DEFLATE stream inside the IDAT chunk(s) so that it doesn't compress:
/* uncompressed_png.cc from https://thejh.net/written-stuff/http-brotli-png-compression */ /* compile with "clang++ -O3 -o uncompressed_png uncompressed_png.cc -lz" or something like that */ #include <vector> #include <string.h> #include <unistd.h> #include <stdlib.h> #include <assert.h> #include <fcntl.h> #include <err.h> #include <stdint.h> #include <stdio.h> #include <limits.h> #include <arpa/inet.h> #include <sys/mman.h> #include <sys/stat.h> #include "zlib.h" #define CT_IHDR (('I'<<24)|('H'<<16)|('D'<<8)|('R'<<0)) #define CT_PLTE (('P'<<24)|('L'<<16)|('T'<<8)|('E'<<0)) #define CT_IDAT (('I'<<24)|('D'<<16)|('A'<<8)|('T'<<0)) #define CT_IEND (('I'<<24)|('E'<<16)|('N'<<8)|('D'<<0)) #define COLORT_PALETTE 1U #define COLORT_COLOR 2U #define COLORT_ALPHA 4U /* based on https://zlib.net/zpipe.c, with modifications */ void inflate_to_buffer(unsigned char *input, size_t input_len, unsigned char *output, size_t output_len) { int ret; z_stream strm; /* allocate inflate state */ strm.zalloc = Z_NULL; strm.zfree = Z_NULL; strm.opaque = Z_NULL; strm.avail_in = input_len; strm.next_in = input; strm.avail_out = output_len; strm.next_out = output; ret = inflateInit(&strm); if (ret != Z_OK) errx(1, "inflateInit failed"); ret = inflate(&strm, Z_FINISH); assert(ret != Z_STREAM_ERROR); /* state not clobbered */ switch (ret) { case Z_NEED_DICT: case Z_DATA_ERROR: case Z_MEM_ERROR: errx(1, "inflate failed"); } /* clean up and return */ (void)inflateEnd(&strm); if (ret != Z_STREAM_END) errx(1, "inflate failed"); if (strm.avail_out != 0) errx(1, "inflated to fewer bytes than expected"); } unsigned char *deflate_to_buffer(unsigned char *input, size_t input_len, size_t *output_len) { int ret; z_stream strm; /* allocate inflate state */ strm.zalloc = Z_NULL; strm.zfree = Z_NULL; strm.opaque = Z_NULL; strm.avail_in = input_len; strm.next_in = input; strm.avail_out = 0; strm.next_out = Z_NULL; ret = deflateInit(&strm, Z_NO_COMPRESSION); if (ret != Z_OK) errx(1, "deflateInit failed"); uLong buf_size = deflateBound(&strm, input_len); unsigned char *buf = (unsigned char *)malloc(buf_size); strm.avail_out = buf_size; strm.next_out = buf; if (strm.next_out == nullptr) errx(1, "malloc fail"); ret = deflate(&strm, Z_FINISH); assert(ret != Z_STREAM_ERROR); /* state not clobbered */ /* clean up and return */ (void)deflateEnd(&strm); if (ret != Z_STREAM_END) errx(1, "deflate failed"); *output_len = buf_size - strm.avail_out; return buf; } /* code based on sample code from RFC 2083 */ /* Table of CRCs of all 8-bit messages. */ static uint32_t crc_table[256]; /* Make the table for a fast CRC. */ static void make_crc_table(void) { for (uint32_t n = 0; n < 256; n++) { uint32_t c = n; for (int k = 0; k < 8; k++) { if (c & 1) c = 0xedb88320L ^ (c >> 1); else c = c >> 1; } crc_table[n] = c; } } #define INIT_CRC 0xffffffffU static uint32_t update_crc(uint32_t crc, unsigned char *buf, size_t len) { uint32_t c = crc; for (size_t n = 0; n < len; n++) { c = crc_table[(c ^ buf[n]) & 0xff] ^ (c >> 8); } return c; } static void write_all(int fd, void *buf, size_t len) { while (len) { ssize_t res = write(fd, buf, len); if (res == -1) err(1, "write to fd %d", fd); if (res == 0) errx(1, "fd %d hit EOF on write", fd); buf = (char*)buf + res; len -= res; } } int main(int argc, char **argv) { make_crc_table(); if (argc != 2) errx(1, "usage"); int fd = open(argv[1], O_RDONLY); if (fd == -1) err(1, "open %s", argv[1]); struct stat st; if (fstat(fd, &st)) err(1, "fstat"); if (st.st_size < 8) errx(1, "input too small for signature"); unsigned char *orig_file = (unsigned char *)mmap(nullptr, st.st_size, PROT_READ, MAP_SHARED, fd, 0); if (orig_file == MAP_FAILED) err(1, "mmap"); if (orig_file[0] != 137 || orig_file[1] != 80 || orig_file[2] != 78 || orig_file[3] != 71 || orig_file[4] != 13 || orig_file[5] != 10 || orig_file[6] != 26 || orig_file[7] != 10) errx(1, "not a PNG file"); uint32_t hdr_width; uint32_t hdr_height; uint8_t hdr_depth; uint8_t hdr_color_type; uint8_t hdr_compression_method; uint8_t hdr_filter_method; uint8_t hdr_interlace_method; size_t off = 8; size_t idat_start = 0, idat_end = 0; bool seen_hdr = false; bool seen_end = false; bool seen_idat = false; std::vector<struct iovec> idat_chunks; uint32_t idat_total_len = 0; while (off < st.st_size) { if (st.st_size - off < 12) errx(1, "incomplete chunk header?"); uint32_t chunk_length = ntohl(*(uint32_t*)(orig_file + off )); uint32_t chunk_type = ntohl(*(uint32_t*)(orig_file + off + 4)); if (st.st_size - (off + 12) < chunk_length) errx(1, "chunk truncated (chunk_length=0x%x)", chunk_length); unsigned char *chunk_data = orig_file + off + 8; uint32_t test_crc = update_crc(INIT_CRC, orig_file + off + 4, 4); test_crc = update_crc(test_crc, chunk_data, chunk_length); test_crc = ~test_crc; uint32_t actual_crc = ntohl(*(uint32_t*)(chunk_data+chunk_length)); if (actual_crc != test_crc) errx(1, "CRC mismatch"); if (!seen_hdr) { seen_hdr = true; if (chunk_type != CT_IHDR) errx(1, "first chunk isn't IHDR"); if (chunk_length != 13) errx(1, "first chunk length is bogus"); hdr_width = ntohl(*(uint32_t*)(chunk_data + 0)); hdr_height = ntohl(*(uint32_t*)(chunk_data + 4)); hdr_depth = chunk_data[8]; hdr_color_type = chunk_data[9]; hdr_compression_method = chunk_data[10]; hdr_filter_method = chunk_data[11]; hdr_interlace_method = chunk_data[12]; if (hdr_width == 0 || hdr_height == 0 || hdr_width > INT32_MAX || hdr_height > INT32_MAX) errx(1, "invalid dimensions"); if (hdr_depth != 1 && hdr_depth != 2 && hdr_depth != 4 && hdr_depth != 8 && hdr_depth != 16) errx(1, "invalid depth"); switch (hdr_color_type) { case 0: /* grayscale sample pixels */ if (hdr_depth != 1 && hdr_depth != 2 && hdr_depth != 4 && hdr_depth != 8 && hdr_depth != 16) errx(1, "invalid depth for grayscale"); break; case COLORT_COLOR: /* RGB triple pixels */ case COLORT_ALPHA: /* grayscale+alpha pixels */ case (COLORT_ALPHA|COLORT_COLOR): /* RGB triple + alpha pixels */ if (hdr_depth != 8 && hdr_depth != 16) errx(1, "invalid depth for RGB / grayscale+alpha / RGB+alpha"); break; case (COLORT_COLOR|COLORT_PALETTE): /* pixels are palette indices */ if (hdr_depth != 1 && hdr_depth != 2 && hdr_depth != 4 && hdr_depth != 8) errx(1, "invalid depth for RGB palette"); break; default: errx(1, "invalid color type"); } if (hdr_compression_method != 0) errx(1, "unknown compression method"); if (hdr_filter_method != 0) errx(1, "unknown filter method"); if (hdr_interlace_method != 0 && hdr_interlace_method != 1) errx(1, "unknown interlace method"); } else { if (chunk_type == CT_IHDR) errx(1, "IHDR after initial chunk"); } if (seen_end) errx(1, "chunk after IEND"); if (chunk_type == CT_IEND) seen_end = true; if (chunk_type == CT_IDAT) { if (!seen_idat) { idat_start = off; seen_idat = true; } if (idat_end) errx(1, "idat chunks are non-adjacent"); idat_chunks.push_back((struct iovec){.iov_base = chunk_data, .iov_len = chunk_length}); if (UINT32_MAX - idat_total_len < chunk_length) errx(1, "files bigger than 4GiB are not supported"); idat_total_len += chunk_length; } else { if (seen_idat && idat_end == 0) idat_end = off; } off += chunk_length + 12; } if (!seen_hdr) errx(1, "no chunks in file"); if (!seen_end) errx(1, "no IEND, truncated file?"); uint32_t pixel_bits = (hdr_color_type & COLORT_PALETTE) ? hdr_depth : (hdr_depth * ( ((hdr_color_type & COLORT_COLOR) ? 3 : 1) + ((hdr_color_type & COLORT_ALPHA) ? 1 : 0) )); printf("bits per (uncompressed) pixel: %u\n", pixel_bits); if (pixel_bits * (uint64_t)hdr_width > UINT32_MAX) errx(1, "scanline size overflows u32"); uint32_t scanline_bits = pixel_bits * hdr_width; uint32_t scanline_bytes = (scanline_bits + 7) / 8; // round up printf("bytes per scanline (without prefix): %u\n", scanline_bytes); if ((scanline_bytes+1) * (uint64_t)hdr_height > UINT32_MAX) errx(1, "raw image overflows u32"); uint32_t inflated_idat_size = (scanline_bytes+1) * hdr_height; printf("inflated IDAT size: %u\n", inflated_idat_size); unsigned char *idat_data = (unsigned char *)malloc(idat_total_len); if (idat_data == nullptr) errx(1, "malloc"); size_t idat_data_written = 0; for (struct iovec iov : idat_chunks) { memcpy(idat_data + idat_data_written, iov.iov_base, iov.iov_len); idat_data_written += iov.iov_len; } assert(idat_data_written == idat_total_len); unsigned char *inflated_idat = (unsigned char *)malloc(inflated_idat_size); if (inflated_idat == nullptr) errx(1, "malloc"); inflate_to_buffer(idat_data, idat_total_len, inflated_idat, inflated_idat_size); char uidat_name[PATH_MAX]; snprintf(uidat_name, sizeof(uidat_name), "%s.uncompressed", argv[1]); int uidat_fd = open(uidat_name, O_WRONLY|O_TRUNC|O_CREAT, 0666); if (uidat_fd == -1) err(1, "open %s for writing", uidat_name); write_all(uidat_fd, orig_file, idat_start); size_t repacked_idat_size; unsigned char *repacked_idat_body = deflate_to_buffer(inflated_idat, inflated_idat_size, &repacked_idat_size); struct { uint32_t length; uint32_t type; } __attribute__((packed)) new_idat_head = { .length = htonl(repacked_idat_size), .type = htonl(CT_IDAT), }; write_all(uidat_fd, &new_idat_head, sizeof(new_idat_head)); uint32_t crc = update_crc(INIT_CRC, (unsigned char *)&new_idat_head.type, sizeof(new_idat_head.type)); write_all(uidat_fd, repacked_idat_body, repacked_idat_size); crc = update_crc(crc, repacked_idat_body, repacked_idat_size); uint32_t nbo_crc = htonl(~crc); write_all(uidat_fd, (unsigned char *)&nbo_crc, sizeof(nbo_crc)); write_all(uidat_fd, orig_file+idat_end, st.st_size-idat_end); close(uidat_fd); return 0; }