GitHub user nlgranger edited a discussion: How to find the best write options for random row reads ?
I am struggling to find optimal configuration for the `ParquetWriter` when the final goal is to read random rows for the resulting dataset. The data from my use-case are fairly "big": one of the row contains the bytes of an image file, so up to a few hundred kiB. I noticed datasets distributed as parquet files are typically ill suited for fast random row reads. A few I have tested from hugging face notably. But when fiddling with encoding parameters (compression, row group size, etc.), it appears that it is possible to achieve fast random access. However, there is not guide to achieve this **consistently**. Going through the options of the `ParquetWriter`, here are some of the options that might need adjusting: - **Group size:** for random access there is no need to make it too big, but too low can be bad. I assume a low group size slows the search for a row in the list of group statistics. - **Page size:** Since the pyarrow reader does not support page-level index, what is the point of having multiple pages per row? - **Compression:** this black magic, sometimes enabling it works, sometimes it break performances. It seems better to disable it and store already compressed data in the row. Also, the columns used for filtering should never be compressed ? - **Sorting columns:** does sorting has any effect on performance in practice ? - **Bloom filters:** is it supported ? Could you share some recommendations or guideline to optimize random row reads ? Also, why are Dataset.take() and Table.take() so damn slow ? References: - https://github.com/waymo-research/waymo-open-dataset/issues/856 - https://huggingface.co/docs/hub/en/datasets-streaming#efficient-random-access GitHub link: https://github.com/apache/arrow/discussions/48940 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
