Hi! Hadoop has builtin support for several so-called hdfs-compatible file systems, including AWS S3, Azure Blob Storage, Azure Data Lake Storage and Azure Data Lake gen2. Using these with hdfs commands requires a little bit of setup in core-site.xml, one of the simplest possible examples being:
<property> <name>fs.azure.account.key.youraccount.blob.core.windows.net</name> <value>YOUR ACCESS KEY</value> </property> At that point, you can issue commands like: hdfs dfs -ls wasbs://[email protected] I currently use spark to access a bunch of azure storage accounts, so I already have the core-site.xml setup and thought to leverage pyarrow.fs.HadoopFileSystem to be able to interact directly with these file systems instead of having to put things on local storage first. I'm working with hive-partitioned datasets, so there's an annoying amount of "double work" in downloading only the necessary partitions. Creating a pyarrow.fs.HadoopFileSystem works fine, but it fails with an exception like: IllegalArgumentException: Wrong FS: wasbs://..., expected: hdfs://localhost:port whenever given one of the configured paths that aren't fs.defaultFS. Is there any way of making this work? Looks like this validation is happening on the java side of the connection, so maybe there's nothing that can be done in arrow? The other option I checked out was to extend pyarrow.fs.FileSystem to write a class built on the Azure Storage SDK, but after reading the pyarrow code, that seems non-trivial, since it's being passed back to C++ under the hood. I'm also seeing some typechecking that seems to indicate that you're not supposed to extend this API. That leaves the option of doing this in C++ using some SDK like https://github.com/Azure/azure-storage-cpplite which is unfortunately a lot more involved for me than I was hoping for when I started tumbling down this particular rabbithole. -- Kind regards, Robin Kåveland
