pyarrow filesystem interface for Azure Data Lake gen2

Robin Kåveland Hansen Thu, 03 Sep 2020 01:45:58 -0700

Hi,

We use Azure Data Lake gen2 heavily at work, and with 1.0 including
pyarrow.fs.PyFileSystem it wasn't that hard to add filesystem support
for it. My employer was happy to let me release it, so I'm getting it
out there.


For now, I published to a pypi package:
https://pypi.org/project/pyarrowfs-adlgen2/

If you're not familiar with Azure Data Lake gen2, it's essentially the
same thing as S3 or Azure Blob Storage, but with real file system
support, meaning operations such as renaming directories or listing
directory contents are practically instant. Directory renames are
atomic, unlike with blob storage, where if some blob rename operations
fail, you may be left with only some files being "moved".

If there's any interest in including this into pyarrow, I'd be happy to
take on some work to do that to make it fit there, but I'm also OK
maintaining this myself.

I couldn't get this working well with writing datasets, but I think that
there's work in progress on pyarrow.fs being supported everywhere
in the parquet codebase?

-- 
Kind regards,
Robin Kåveland

pyarrow filesystem interface for Azure Data Lake gen2

Reply via email to