Re: [pyarrow] How to enable memory mapping in pyarrow.parquet.read_table

Wes McKinney Thu, 27 Feb 2020 16:14:40 -0800

A Parquet file has to be deserialized into Arrow format -- memory
mapping is not possible.


A goal of the Datasets framework in active development is to provide a
batch-based iterator interface which will enable more processing files
(e.g. converting them into Arrow IPC files, which can be memory
mapped) in a memory constrained fashion.

On Thu, Feb 27, 2020 at 4:14 PM filippo medri <filippo.me...@gmail.com> wrote:
>
> Hi,
> experimenting with :
>
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pq.read_table(source,memory_mapped=True)
> mem_bytes = pa.total_allocated_bytes()
>
> I have observed that mem_bytes is about the size of the parquet file on disk.
> If I remove the assignment and execute
> pq.read_table(source,memory_mapped=True)
> mem_bytes = pa.total_allocated_bytes()
>
> mem_bytes is 0
>
> Environment is Ubuntu 16, python 2.7.17, pyarrow 0.16.0 installed with pip 
> install, the parquet file
> is made by saving 4 numpy arrays of doubles to an arrow table and then saving 
> them to parquet with the write_table function.
>
> My goal is to read the parquet file in a memory mapped table and than reading 
> it a record batch at a time, with:
> batches = tables.to_batches()
> for batch in batches:
>    # do something with the batch then save it to disk
>
> At the present time I am able to load a parquet file in an arrow table, split 
> it to batches, add columns and then write each RecordBatch to a parquet file, 
> but the read_table function seems to be loading all data into memory.
>
> Is there a way to load a parquet file in a table in memory a record batch at 
> a time? Or just stream RecordBatch from a parquet file without loading all 
> the content in memory?
>
> Thanks in advance,
> Filippo Medri
>
>
>
>
>

Re: [pyarrow] How to enable memory mapping in pyarrow.parquet.read_table

Reply via email to