Learning pyarrow and optimize row groups size

jonathan mercier Wed, 18 Mar 2020 03:31:50 -0700

Dear,

I am learning pyarrow API and arrow tecnology. So I would like first to
thank you for your works.



>From my understanding pyarrow.arrays, pyarrow.RecordBatch are write
only structure. We can not append data.
1/ is it correct ?


I wrote a little script to write data into parquet file. The data is a
2D list ( a list of rows which contains a list of columns
[['a','b','c'], ['d','e','f']])
Script is here:

https://gist.github.com/bioinfornatics/c82398fa22339d34f41b3580c988c308

To obtain this goal I stored in memory all intermediate pyarrow
structures in order to create a table (schema and list of pyarrow
array)

2/ is it possible to reach the same goal with a stream in order to not
waste memory/handle terabyte of data ?



I read these interesting articles: 
https://www.dremio.com/tuning-parquet/, 
https://parquet.apache.org/documentation/latest/

 which recommends large row groups (512MB - 1GB).
3/ how to manage row group in order to feat approximately the size 1GB
?

4) using pyarrow should store at end (on disk) to a parquet file or
pyarrow provide its generic file as common data layer?


Thanks a lot for your help and your works on arrow

Best regards

Jonathan

Learning pyarrow and optimize row groups size

Reply via email to