GitHub user HighpassStudio added a comment to the discussion: Will order be 
preserved when writing/reading a parquet file with ordered dictionaries?

This kind of question makes me think about how much behavior we rely on from 
the underlying format vs how data is packaged.

Will categories always be read back in the same order?
- With PyArrow reading a file written by PyArrow, often yes, more faithfully, 
because of stored Arrow schema metadata. But I would not treat this as a 
universal Parquet guarantee, especially across different readers/writers.

If the file has multiple row groups, will each row group dictionary be the same?
- No. In Parquet, dictionary pages are per column chunk / per row group, and 
they may differ and omit categories absent from that row group.

If you need a rock-solid guarantee of category order across files/readers, try 
storing the category list explicitly in metadata or a companion schema artifact.

One thing I’ve run into is that once data leaves formats like Parquet and gets 
bundled into archives (zip/tar/etc.), we lose a lot of these guarantees around 
selective reads and ordering. You often end up decompressing everything just to 
access one piece.

Are people just avoiding archives entirely in these workflows, or is there a 
pattern for preserving efficient access once data is packaged?

GitHub link: 
https://github.com/apache/arrow/discussions/49508#discussioncomment-16233430

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to