[ 
https://issues.apache.org/jira/browse/ARROW-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8964:
--------------------------------
    Summary: [Python][Parquet] improve reading of partitioned parquet datasets 
whose schema changed  (was: Pyarrow: improve reading of partitioned parquet 
datasets whose schema changed)

> [Python][Parquet] improve reading of partitioned parquet datasets whose 
> schema changed
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-8964
>                 URL: https://issues.apache.org/jira/browse/ARROW-8964
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 0.17.1
>         Environment: Ubuntu 18.04, latest miniconda with python 3.7, pyarrow 
> 0.17.1
>            Reporter: Ira Saktor
>            Priority: Major
>
> Hi there, i'm encountering the following issue when reading from HDFS:
>  
> *My situation:*
> I have a paritioned parquet dataset in HDFS, whose recent partitions contain 
> parquet files with more columns than the older ones. When i try to read data 
> using pyarrow.dataset.dataset and filter on recent data, i still get only the 
> columns that are also contained in the old parquet files. I'd like to somehow 
> merge the schema or use the schema from parquet files from which data ends up 
> being loaded.
> *when using:*
> `pyarrow.dataset.dataset(path_to_hdfs_directory, paritioning = 'hive', 
> filters = my_filter_expression).to_table().to_pandas()`
> Is there please a way to handle schema changes in a way, that the read data 
> would contain all columns?
> everything works fine when i copy the needed parquet files into a separate 
> folder, however it is very inconvenient way of working. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to