Dear, I am learning pyarrow API and arrow tecnology. So I would like first to thank you for your works.
>From my understanding pyarrow.arrays, pyarrow.RecordBatch are write only structure. We can not append data. 1/ is it correct ? I wrote a little script to write data into parquet file. The data is a 2D list ( a list of rows which contains a list of columns [['a','b','c'], ['d','e','f']]) Script is here: https://gist.github.com/bioinfornatics/c82398fa22339d34f41b3580c988c308 To obtain this goal I stored in memory all intermediate pyarrow structures in order to create a table (schema and list of pyarrow array) 2/ is it possible to reach the same goal with a stream in order to not waste memory/handle terabyte of data ? I read these interesting articles: https://www.dremio.com/tuning-parquet/, https://parquet.apache.org/documentation/latest/ which recommends large row groups (512MB - 1GB). 3/ how to manage row group in order to feat approximately the size 1GB ? 4) using pyarrow should store at end (on disk) to a parquet file or pyarrow provide its generic file as common data layer? Thanks a lot for your help and your works on arrow Best regards Jonathan
