pyarrow dataset API

Chang She Tue, 01 Nov 2022 16:25:51 -0700

Hi there,

The pyarrow dataset API is marked experimental so I'm curious if y'all have
made any decisions on it for upcoming releases. Specifically, any thoughts
on making the Scanner and things like FileSystemDataset part of the "public
API" (i.e., putting declarations in the _dataset.pxd)? It would make it a
lot easier for new data formats to be built on top of the Arrow platform.
e.g., Lance supports efficient partial reads from s3 for limit/offset (via
additional ScanOptions), but currently it's difficult to expose the scanner
to the rest of Arrow. Instead we subclass Dataset and return a custom
scanner we created. And our Dataset subclass *should* be a
FileSystemDataset subclass, but FileSystemDataset is not "public API" etc.
Happy to discuss additional details, for reference: github.com/eto-ai/lance


Thanks!

Chang

pyarrow dataset API

Reply via email to