GitHub user willtemperley added a comment to the discussion: Why does the
Parquet encoding list show both 'PLAIN' and 'BIT_PACKED'? Is BIT_PACKED
actually used?
You have to look in the actual column data to find this - it's not in the
footer FileMetadata.
A column chunk includes a list of pages, with an optional dictionary page
first. Each of these, including the dictionary are prepended with an
uncompressed PageHeader in Thrift format.
The DataPageHeader [1] in the PageHeaders that are data pages (PageHeader.type
== DataPage) will have a DataPageHeader, in which you can find the encodings
for rep/def levels.
I've noticed in Overture Maps parquet files which are also encoded with
parquet-mr (aka parquet-java), BIT_PACKED is stated as the encoding for the
definition levels when the definition levels are empty. This might be an
encoding artifact, but I haven't confirmed it. Given that
definition_level_encoding is a required metadata field, this is certainly a
possibility.
[1] From the Parquet schema:
```thrift
/** Data page header */
struct DataPageHeader {
/**
* Number of values, including NULLs, in this data page.
*
* If a OffsetIndex is present, a page must begin at a row
* boundary (repetition_level = 0). Otherwise, pages may begin
* within a row (repetition_level > 0).
**/
1: required i32 num_values
/** Encoding used for this data page **/
2: required Encoding encoding
/** Encoding used for definition levels **/
3: required Encoding definition_level_encoding;
/** Encoding used for repetition levels **/
4: required Encoding repetition_level_encoding;
/** Optional statistics for the data in this page **/
5: optional Statistics statistics;
}
```
GitHub link:
https://github.com/apache/arrow/discussions/47113#discussioncomment-14314874
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]