GitHub user nlgranger edited a discussion: How to find the best write options 
for random row reads ?

I am struggling to find optimal configuration for the `ParquetWriter` when the 
final goal is to read random rows for the resulting dataset. The data from my 
use-case are fairly "big": one of the row contains the bytes of an image file, 
so up to a few hundred kiB.

I noticed datasets distributed as parquet files are typically ill suited for 
fast random row reads. A few I have tested from hugging face notably. But when 
fiddling with encoding parameters (compression, row group size, etc.), it 
appears that it is possible to achieve fast random access. However, there is 
not guide to achieve this **consistently**.

Going through the options of the `ParquetWriter`, here are some of the options 
that might need adjusting:
- **Group size:** for random access there is no need to make it too big, but 
too low can be bad. I assume a low group size slows the search for a row in the 
list of group statistics.
- **Page size:** Since the pyarrow reader does not support page-level index, 
what is the point of having multiple pages per row?
- **Compression:** this black magic, sometimes enabling it works, sometimes it 
break performances. It seems better to disable it and store already compressed 
data in the row. Also, the columns used for filtering should never be 
compressed ?
- **Sorting columns:** does sorting has any effect on performance in practice ?
- **Bloom filters:** is it supported ?

Could you share some recommendations or guideline to optimize random row reads ?

Also, why are Dataset.take() and Table.take() so damn slow ?

References:
- https://github.com/waymo-research/waymo-open-dataset/issues/856
- https://huggingface.co/docs/hub/en/datasets-streaming#efficient-random-access

GitHub link: https://github.com/apache/arrow/discussions/48940

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to