For the record, as I've stated elsewhere I'm fairly sure, I don't agree with toggling memory mapping at the filesystem level. If a filesystem supports memory mapping, then a consumer of the filesystem should IMHO be able to request a memory map.
On Thu, Apr 30, 2020 at 2:27 AM Joris Van den Bossche <[email protected]> wrote: > > Hi Dan, > > Currently, the memory mapping in the Datasets API is controlled by the > filesystem. So to enable memory mapping for feather, you can do: > > import pyarrow.dataset as ds > from pyarrow.fs import LocalFileSystem > > fs = LocalFileSystem(use_mmap=True) > t = ds.dataset('demo', format='feather', filesystem=fs).to_table() > > Can you try if that is working for you? > We should better document this (and there is actually also some discussion > about the best API for this, see > https://issues.apache.org/jira/browse/ARROW-8156, > https://issues.apache.org/jira/browse/ARROW-8307) > > Joris > > On Thu, 30 Apr 2020 at 01:58, Daniel Nugent <[email protected]> wrote: >> >> Hi, >> >> I'm trying to use the 0.17 dataset API to map in an arrow table in the >> uncompressed feather format (ultimately hoping to work with data larger than >> memory). It seems like it reads all the constituent files into memory before >> creating the Arrow table object though. >> >> When I use the FeatherDataset API, it does appear to work map the files and >> the Table is created based off of mapped data. >> >> Any hints at what I'm doing wrong? I didn't see any options relating to >> memory mapping for the general datasets >> >> Here's the code for the plain dataset api call: >> >> from pyarrow.dataset import dataset as ds >> t = ds('demo', format='feather').read_table() >> >> Here's the code for reading using the FeatherDataset api: >> >> from pyarrow.feather import FeatherDataset as ds >> from pathlib import Path >> t = ds(list(Path('demo').iterdir())).read_table() >> >> Thanks! >> >> -Dan Nugent
