GitHub user nlgranger edited a discussion: How to find the best write options 
for read of file bytes ?

# TLDR

I am struggling to find optimal configuration for the `ParquetWriter` when the 
final goal is to read random rows for the resulting dataset. 

The data is:
- one column of file names as a strings
- one column containing the bytes of an image file, so up to a few hundred kiB.

# What I tried

I noticed datasets distributed as parquet files are typically ill suited for 
fast random row reads. A few I have tested from hugging face notably.

Going through the options of the `ParquetWriter`, here are some of the options 
that might need adjusting:
- **Group size:** for random access there is no need to make it too big, but 
too low can be bad. I assume a low group size slows the search for a row in the 
list of group statistics.
- **Page size:** Since the pyarrow reader does not support page-level index, 
what is the point of having multiple pages per row?
- **Compression:** this black magic, sometimes enabling it works, sometimes it 
break performances. It seems better to disable it and store already compressed 
data in the row. Also, the columns used for filtering should never be 
compressed ?
- **Sorting columns:** does sorting has any effect on performance in practice ?
- **Bloom filters:** is it supported in pyarrow ?

**Could you share some recommendations or guideline to optimize random row 
reads ?**

*(Also, why are Dataset.take() and Table.take() so damn slow ?)*

# References:

- https://github.com/waymo-research/waymo-open-dataset/issues/856
- https://huggingface.co/docs/hub/en/datasets-streaming#efficient-random-access

GitHub link: https://github.com/apache/arrow/discussions/48940

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to