A Parquet file has to be deserialized into Arrow format -- memory mapping is not possible.
A goal of the Datasets framework in active development is to provide a batch-based iterator interface which will enable more processing files (e.g. converting them into Arrow IPC files, which can be memory mapped) in a memory constrained fashion. On Thu, Feb 27, 2020 at 4:14 PM filippo medri <filippo.me...@gmail.com> wrote: > > Hi, > experimenting with : > > import pyarrow as pa > import pyarrow.parquet as pq > table = pq.read_table(source,memory_mapped=True) > mem_bytes = pa.total_allocated_bytes() > > I have observed that mem_bytes is about the size of the parquet file on disk. > If I remove the assignment and execute > pq.read_table(source,memory_mapped=True) > mem_bytes = pa.total_allocated_bytes() > > mem_bytes is 0 > > Environment is Ubuntu 16, python 2.7.17, pyarrow 0.16.0 installed with pip > install, the parquet file > is made by saving 4 numpy arrays of doubles to an arrow table and then saving > them to parquet with the write_table function. > > My goal is to read the parquet file in a memory mapped table and than reading > it a record batch at a time, with: > batches = tables.to_batches() > for batch in batches: > # do something with the batch then save it to disk > > At the present time I am able to load a parquet file in an arrow table, split > it to batches, add columns and then write each RecordBatch to a parquet file, > but the read_table function seems to be loading all data into memory. > > Is there a way to load a parquet file in a table in memory a record batch at > a time? Or just stream RecordBatch from a parquet file without loading all > the content in memory? > > Thanks in advance, > Filippo Medri > > > > >