This is an automated email from the ASF dual-hosted git repository. blue pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push: new 709e25e PARQUET-1290: clarify run lengths for RLE encoding (#96) 709e25e is described below commit 709e25e9abe05f648dd96d99518141791ba94101 Author: Tim Armstrong <tim.g.armstr...@gmail.com> AuthorDate: Mon May 7 09:51:05 2018 -0700 PARQUET-1290: clarify run lengths for RLE encoding (#96) --- Encodings.md | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/Encodings.md b/Encodings.md index f3b8d50..9358b13 100644 --- a/Encodings.md +++ b/Encodings.md @@ -72,12 +72,14 @@ length := length of the <encoded-data> in bytes stored as 4 bytes little endian encoded-data := <run>* run := <bit-packed-run> | <rle-run> bit-packed-run := <bit-packed-header> <bit-packed-values> -bit-packed-header := varint-encode(<bit-pack-count> << 1 | 1) +bit-packed-header := varint-encode(<bit-pack-scaled-run-len> << 1 | 1) // we always bit-pack a multiple of 8 values at a time, so we only store the number of values / 8 -bit-pack-count := (number of values in this run) / 8 +bit-pack-scaled-run-len := (bit-packed-run-len) / 8 +bit-packed-run-len := *see 3 below* bit-packed-values := *see 1 below* rle-run := <rle-header> <repeated-value> -rle-header := varint-encode( (number of times repeated) << 1) +rle-header := varint-encode( (rle-run-len) << 1) +rle-run-len := *see 3 below* repeated-value := value that is repeated, using a fixed-width of round-up-to-next-byte(bit-width) ``` @@ -107,6 +109,13 @@ repeated-value := value that is repeated, using a fixed-width of round-up-to-nex 2. varint-encode() is ULEB-128 encoding, see https://en.wikipedia.org/wiki/LEB128 +3. bit-packed-run-len and rle-run-len must be in the range \[1, 2<sup>31</sup> - 1\]. + This means that a Parquet implementation can always store the run length in a signed + 32-bit integer. This length restriction was not part of the Parquet 2.5.0 and earlier + specifications, but longer runs were not readable by the most common Parquet + implementations so, in practice, were not safe for Parquet writers to emit. + + Note that the RLE encoding method is only supported for the following types of data: -- To stop receiving notification emails like this one, please contact b...@apache.org.