I just commented about this in https://issues.apache.org/jira/browse/ARROW-2034
Our preferred path forward would almost certainly be to build a C++ implementation of the arrow::filesystem::Filesystem interface that deals with Azure and then that would be straightforward to hook up with the Datasets API On Wed, May 6, 2020 at 2:58 AM Robin Kåveland Hansen <[email protected]> wrote: > > Hi, > > You're right, I want dataset functionality, I'm able to read individual > files into memory and passing them to arrow just fine, like the example > from the documentation. > > On 3 May 2020 at 00:12:48, Micah Kornfield ([email protected]) wrote: > > Hi Robin, > I'm not an expert in this area and there has been a lot of change since I > looked into this, but I there was an old PR that looked to do a python > implementation [1], as you noted this was closed in favor of trying to target > a C++ implementation. It sounds like you may want more data-set like > functionality, but does the example given for reading from Azure in the > documentation work for you [2]? I think there are similar APIs for parsing > other file types. > > Hope this helps. > > -Micah > > [1] https://github.com/apache/arrow/pull/4121 > [2] > https://arrow.apache.org/docs/python/parquet.html#reading-a-parquet-file-from-azure-blob-storage > > On Fri, May 1, 2020 at 4:49 AM Robin Kåveland Hansen <[email protected]> > wrote: >> >> Hi! >> >> Hadoop has builtin support for several so-called hdfs-compatible file >> systems, including AWS S3, Azure Blob Storage, Azure Data Lake Storage >> and Azure Data Lake gen2. Using these with hdfs commands requires a >> little bit of setup in core-site.xml, one of the simplest possible >> examples being: >> >> <property> >> <name>fs.azure.account.key.youraccount.blob.core.windows.net</name> >> <value>YOUR ACCESS KEY</value> >> </property> >> >> At that point, you can issue commands like: >> >> hdfs dfs -ls wasbs://[email protected] >> >> I currently use spark to access a bunch of azure storage accounts, so I >> already have the core-site.xml setup and thought to leverage >> pyarrow.fs.HadoopFileSystem to be able to interact directly with these >> file systems instead of having to put things on local storage first. I'm >> working with hive-partitioned datasets, so there's an annoying amount of >> "double work" in downloading only the necessary partitions. >> >> Creating a pyarrow.fs.HadoopFileSystem works fine, but it fails with an >> exception like: >> >> IllegalArgumentException: Wrong FS: wasbs://..., expected: >> hdfs://localhost:port >> >> whenever given one of the configured paths that aren't fs.defaultFS. >> >> Is there any way of making this work? Looks like this validation is >> happening on the java side of the connection, so maybe there's nothing >> that can be done in arrow? >> >> The other option I checked out was to extend pyarrow.fs.FileSystem to >> write a class built on the Azure Storage SDK, but after reading the >> pyarrow code, that seems non-trivial, since it's being passed back to >> C++ under the hood. I'm also seeing some typechecking that seems to >> indicate that you're not supposed to extend this API. >> >> That leaves the option of doing this in C++ using some SDK like >> https://github.com/Azure/azure-storage-cpplite which is unfortunately a >> lot more involved for me than I was hoping for when I started tumbling >> down this particular rabbithole. >> >> -- >> Kind regards, >> Robin Kåveland >> > -- > Vennlig hilsen, > Robin Kåveland >
