We previously preserved an LZ4 CompressionKind and plan to implement it in the Presto reader and writer. Unlikely Snappy, the LZ4 format does not record the uncompressed length. Thus, when reading, we need to allocate an output buffer that is the full compressionBlockSize. This can waste a lot of memory when there are many streams and many open readers.
We propose to prefix the LZ4 block with the uncompressed size. I see a few ways of doing it: 1) Variable length integer, the same as Snappy. 2) Fixed 3-byte integer, little-endian. 3) Fixed 4-byte integer, little-endian. Option #1 is more complicated, uses more CPU to decode, and probably doesn't save much space; buffers starting at 16kB will use 3 bytes. Option #2 restricts the maximum size to be 16MB-1 byte. This is ridiculously large for a per-stream buffer and not a problem as current writers cap the buffer size at a reasonable 256kB, so it shouldn't be a problem in practice, but it's worth calling out here. Option #3 is flexible but in practice will waste a byte. My vote is for option #2.
