Re: 'Plain' Dataset Python API doesn't memory map?

Daniel Nugent Sun, 03 May 2020 00:11:18 -0700

Thanks Joris. That did the trick.


-Dan Nugent
On Apr 30, 2020, 10:01 -0400, Wes McKinney <[email protected]>, wrote:
> For the record, as I've stated elsewhere I'm fairly sure, I don't
> agree with toggling memory mapping at the filesystem level. If a
> filesystem supports memory mapping, then a consumer of the filesystem
> should IMHO be able to request a memory map.
>
> On Thu, Apr 30, 2020 at 2:27 AM Joris Van den Bossche
> <[email protected]> wrote:
> >
> > Hi Dan,
> >
> > Currently, the memory mapping in the Datasets API is controlled by the 
> > filesystem. So to enable memory mapping for feather, you can do:
> >
> > import pyarrow.dataset as ds
> > from pyarrow.fs import LocalFileSystem
> >
> > fs = LocalFileSystem(use_mmap=True)
> > t = ds.dataset('demo', format='feather', filesystem=fs).to_table()
> >
> > Can you try if that is working for you?
> > We should better document this (and there is actually also some discussion 
> > about the best API for this, see 
> > https://issues.apache.org/jira/browse/ARROW-8156, 
> > https://issues.apache.org/jira/browse/ARROW-8307)
> >
> > Joris
> >
> > On Thu, 30 Apr 2020 at 01:58, Daniel Nugent <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > I'm trying to use the 0.17 dataset API to map in an arrow table in the 
> > > uncompressed feather format (ultimately hoping to work with data larger 
> > > than memory). It seems like it reads all the constituent files into 
> > > memory before creating the Arrow table object though.
> > >
> > > When I use the FeatherDataset API, it does appear to work map the files 
> > > and the Table is created based off of mapped data.
> > >
> > > Any hints at what I'm doing wrong? I didn't see any options relating to 
> > > memory mapping for the general datasets
> > >
> > > Here's the code for the plain dataset api call:
> > >
> > > from pyarrow.dataset import dataset as ds
> > > t = ds('demo', format='feather').read_table()
> > >
> > > Here's the code for reading using the FeatherDataset api:
> > >
> > > from pyarrow.feather import FeatherDataset as ds
> > > from pathlib import Path
> > > t = ds(list(Path('demo').iterdir())).read_table()
> > >
> > > Thanks!
> > >
> > > -Dan Nugent

Re: 'Plain' Dataset Python API doesn't memory map?

Reply via email to