GitHub user willtemperley added a comment to the discussion: Why does the 
Parquet encoding list show both 'PLAIN' and 'BIT_PACKED'? Is BIT_PACKED 
actually used?

You have to look in the actual column data to find this - it's not in the 
footer FileMetadata.

A column chunk includes a list of pages, with an optional dictionary page 
first. Each of these, including the dictionary are prepended with an 
uncompressed PageHeader in Thrift format.

The DataPageHeader [1] in the PageHeaders that are data pages (PageHeader.type 
== DataPage) will have a DataPageHeader, in which you can find the encodings 
for rep/def levels.

I've noticed in Overture Maps parquet files which are also encoded with 
parquet-mr (aka parquet-java), BIT_PACKED is stated as the encoding for the 
definition levels when the definition levels are empty. This might be an 
encoding artifact, but I haven't confirmed it. Given that 
definition_level_encoding is a required metadata field, this is certainly a 
possibility.

[1] From the Parquet schema:
```thrift
/** Data page header */
struct DataPageHeader {
  /**
   * Number of values, including NULLs, in this data page.
   *
   * If a OffsetIndex is present, a page must begin at a row
   * boundary (repetition_level = 0). Otherwise, pages may begin
   * within a row (repetition_level > 0).
   **/
  1: required i32 num_values

  /** Encoding used for this data page **/
  2: required Encoding encoding

  /** Encoding used for definition levels **/
  3: required Encoding definition_level_encoding;

  /** Encoding used for repetition levels **/
  4: required Encoding repetition_level_encoding;

  /** Optional statistics for the data in this page **/
  5: optional Statistics statistics;
}



```

GitHub link: 
https://github.com/apache/arrow/discussions/47113#discussioncomment-14314874

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to