Hi, experimenting with : import pyarrow as pa import pyarrow.parquet as pq table = pq.read_table(source,memory_mapped=True) mem_bytes = pa.total_allocated_bytes()
I have observed that mem_bytes is about the size of the parquet file on disk. If I remove the assignment and execute pq.read_table(source,memory_mapped=True) mem_bytes = pa.total_allocated_bytes() mem_bytes is 0 Environment is Ubuntu 16, python 2.7.17, pyarrow 0.16.0 installed with pip install, the parquet file is made by saving 4 numpy arrays of doubles to an arrow table and then saving them to parquet with the write_table function. My goal is to read the parquet file in a memory mapped table and than reading it a record batch at a time, with: batches = tables.to_batches() for batch in batches: # do something with the batch then save it to disk At the present time I am able to load a parquet file in an arrow table, split it to batches, add columns and then write each RecordBatch to a parquet file, but the read_table function seems to be loading all data into memory. Is there a way to load a parquet file in a table in memory a record batch at a time? Or just stream RecordBatch from a parquet file without loading all the content in memory? Thanks in advance, Filippo Medri
