Lance Dacey created ARROW-9514:
----------------------------------

             Summary: The new Dataset API will not work with files on Azure 
Blob (pq.read_table() does work and so does Dask)
                 Key: ARROW-9514
                 URL: https://issues.apache.org/jira/browse/ARROW-9514
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
    Affects Versions: 0.17.1
         Environment: Ubuntu 18.04
            Reporter: Lance Dacey
             Fix For: 0.17.1


I tried using  pyarrow.dataset and pq.ParquetDataset(use_legacy_system=False) 
and my connection to Azure Blob fails. 

 

I know the documentation says only hdfs and s3 are implemented, but I have been 
using Azure Blob by using fsspec as the filesystem when reading and writing 
parquet files/datasets with Pyarrow (with use_legacy_system=True). Also, Dask 
works with storage_options.

I am hoping that Azure Blob will be supported because I'd really like to try 
out the new row filtering and non-hive partitioning schemes.

This is what I use for the filesystem when using read_table() or 
write_to_dataset():

 

fs = fsspec.filesystem(protocol='abfs', 
 account_name=base.login, 
 account_key=base.password)

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to