[ https://issues.apache.org/jira/browse/ARROW-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated ARROW-8964: -------------------------------- Summary: [Python][Parquet] improve reading of partitioned parquet datasets whose schema changed (was: Pyarrow: improve reading of partitioned parquet datasets whose schema changed) > [Python][Parquet] improve reading of partitioned parquet datasets whose > schema changed > -------------------------------------------------------------------------------------- > > Key: ARROW-8964 > URL: https://issues.apache.org/jira/browse/ARROW-8964 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Affects Versions: 0.17.1 > Environment: Ubuntu 18.04, latest miniconda with python 3.7, pyarrow > 0.17.1 > Reporter: Ira Saktor > Priority: Major > > Hi there, i'm encountering the following issue when reading from HDFS: > > *My situation:* > I have a paritioned parquet dataset in HDFS, whose recent partitions contain > parquet files with more columns than the older ones. When i try to read data > using pyarrow.dataset.dataset and filter on recent data, i still get only the > columns that are also contained in the old parquet files. I'd like to somehow > merge the schema or use the schema from parquet files from which data ends up > being loaded. > *when using:* > `pyarrow.dataset.dataset(path_to_hdfs_directory, paritioning = 'hive', > filters = my_filter_expression).to_table().to_pandas()` > Is there please a way to handle schema changes in a way, that the read data > would contain all columns? > everything works fine when i copy the needed parquet files into a separate > folder, however it is very inconvenient way of working. > -- This message was sent by Atlassian Jira (v8.3.4#803005)