Hi Robin, I'm not an expert in this area and there has been a lot of change since I looked into this, but I there was an old PR that looked to do a python implementation [1], as you noted this was closed in favor of trying to target a C++ implementation. It sounds like you may want more data-set like functionality, but does the example given for reading from Azure in the documentation work for you [2]? I think there are similar APIs for parsing other file types.
Hope this helps. -Micah [1] https://github.com/apache/arrow/pull/4121 [2] https://arrow.apache.org/docs/python/parquet.html#reading-a-parquet-file-from-azure-blob-storage On Fri, May 1, 2020 at 4:49 AM Robin Kåveland Hansen <[email protected]> wrote: > Hi! > > Hadoop has builtin support for several so-called hdfs-compatible file > systems, including AWS S3, Azure Blob Storage, Azure Data Lake Storage > and Azure Data Lake gen2. Using these with hdfs commands requires a > little bit of setup in core-site.xml, one of the simplest possible > examples being: > > <property> > <name>fs.azure.account.key.youraccount.blob.core.windows.net</name> > <value>YOUR ACCESS KEY</value> > </property> > > At that point, you can issue commands like: > > hdfs dfs -ls wasbs://[email protected] > > I currently use spark to access a bunch of azure storage accounts, so I > already have the core-site.xml setup and thought to leverage > pyarrow.fs.HadoopFileSystem to be able to interact directly with these > file systems instead of having to put things on local storage first. I'm > working with hive-partitioned datasets, so there's an annoying amount of > "double work" in downloading only the necessary partitions. > > Creating a pyarrow.fs.HadoopFileSystem works fine, but it fails with an > exception like: > > IllegalArgumentException: Wrong FS: wasbs://..., expected: > hdfs://localhost:port > > whenever given one of the configured paths that aren't fs.defaultFS. > > Is there any way of making this work? Looks like this validation is > happening on the java side of the connection, so maybe there's nothing > that can be done in arrow? > > The other option I checked out was to extend pyarrow.fs.FileSystem to > write a class built on the Azure Storage SDK, but after reading the > pyarrow code, that seems non-trivial, since it's being passed back to > C++ under the hood. I'm also seeing some typechecking that seems to > indicate that you're not supposed to extend this API. > > That leaves the option of doing this in C++ using some SDK like > https://github.com/Azure/azure-storage-cpplite which is unfortunately a > lot more involved for me than I was hoping for when I started tumbling > down this particular rabbithole. > > -- > Kind regards, > Robin Kåveland > >
