Lance Dacey created ARROW-9514: ---------------------------------- Summary: The new Dataset API will not work with files on Azure Blob (pq.read_table() does work and so does Dask) Key: ARROW-9514 URL: https://issues.apache.org/jira/browse/ARROW-9514 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.17.1 Environment: Ubuntu 18.04 Reporter: Lance Dacey Fix For: 0.17.1
I tried using pyarrow.dataset and pq.ParquetDataset(use_legacy_system=False) and my connection to Azure Blob fails. I know the documentation says only hdfs and s3 are implemented, but I have been using Azure Blob by using fsspec as the filesystem when reading and writing parquet files/datasets with Pyarrow (with use_legacy_system=True). Also, Dask works with storage_options. I am hoping that Azure Blob will be supported because I'd really like to try out the new row filtering and non-hive partitioning schemes. This is what I use for the filesystem when using read_table() or write_to_dataset(): fs = fsspec.filesystem(protocol='abfs', account_name=base.login, account_key=base.password) -- This message was sent by Atlassian Jira (v8.3.4#803005)