Re: 'Plain' Dataset Python API doesn't memory map?

Wes McKinney Thu, 30 Apr 2020 07:01:36 -0700

For the record, as I've stated elsewhere I'm fairly sure, I don't
agree with toggling memory mapping at the filesystem level. If a
filesystem supports memory mapping, then a consumer of the filesystem
should IMHO be able to request a memory map.


On Thu, Apr 30, 2020 at 2:27 AM Joris Van den Bossche
<[email protected]> wrote:
>
> Hi Dan,
>
> Currently, the memory mapping in the Datasets API is controlled by the 
> filesystem. So to enable memory mapping for feather, you can do:
>
> import pyarrow.dataset as ds
> from pyarrow.fs import LocalFileSystem
>
> fs = LocalFileSystem(use_mmap=True)
> t = ds.dataset('demo', format='feather', filesystem=fs).to_table()
>
> Can you try if that is working for you?
> We should better document this (and there is actually also some discussion 
> about the best API for this, see 
> https://issues.apache.org/jira/browse/ARROW-8156, 
> https://issues.apache.org/jira/browse/ARROW-8307)
>
> Joris
>
> On Thu, 30 Apr 2020 at 01:58, Daniel Nugent <[email protected]> wrote:
>>
>> Hi,
>>
>> I'm trying to use the 0.17 dataset API to map in an arrow table in the 
>> uncompressed feather format (ultimately hoping to work with data larger than 
>> memory). It seems like it reads all the constituent files into memory before 
>> creating the Arrow table object though.
>>
>> When I use the FeatherDataset API, it does appear to work map the files and 
>> the Table is created based off of mapped data.
>>
>> Any hints at what I'm doing wrong? I didn't see any options relating to 
>> memory mapping for the general datasets
>>
>> Here's the code for the plain dataset api call:
>>
>>     from pyarrow.dataset import dataset as ds
>>     t = ds('demo', format='feather').read_table()
>>
>> Here's the code for reading using the FeatherDataset api:
>>
>>     from pyarrow.feather import FeatherDataset as ds
>>     from pathlib import Path
>>     t = ds(list(Path('demo').iterdir())).read_table()
>>
>> Thanks!
>>
>> -Dan Nugent

Re: 'Plain' Dataset Python API doesn't memory map?

Reply via email to