[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1956: Labels: dataset dataset-parquet-read parquet (was: dataset-parquet-read parquet) > [Python] Support reading specific partitions from a partitioned parquet > dataset > --- > > Key: ARROW-1956 > URL: https://issues.apache.org/jira/browse/ARROW-1956 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 > Environment: Kernel: 4.14.8-300.fc27.x86_64 > Python: 3.6.3 >Reporter: Suvayu Ali >Priority: Minor > Labels: dataset, dataset-parquet-read, parquet > Attachments: so-example.py > > > I want to read specific partitions from a partitioned parquet dataset. This > is very useful in case of large datasets. I have attached a small script > that creates a dataset and shows what is expected when reading (quoting > salient points below). > # There is no way to read specific partitions in Pandas > # In pyarrow I tried to achieve the goal by providing a list of > files/directories to ParquetDataset, but it didn't work: > # In PySpark it works if I simply do: > {code:none} > spark.read.options('basePath', 'datadir').parquet(*list_of_partitions) > {code} > I also couldn't find a way to easily write partitioned parquet files. In the > end I did it by hand by creating the directory hierarchies, and writing the > individual files myself (similar to the implementation in the attached > script). Again, in PySpark I can do > {code:none} > df.write.partitionBy(*list_of_partitions).parquet(output) > {code} > to achieve that. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-1956: - Labels: dataset-parquet-read parquet (was: parquet) > [Python] Support reading specific partitions from a partitioned parquet > dataset > --- > > Key: ARROW-1956 > URL: https://issues.apache.org/jira/browse/ARROW-1956 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 > Environment: Kernel: 4.14.8-300.fc27.x86_64 > Python: 3.6.3 >Reporter: Suvayu Ali >Priority: Minor > Labels: dataset-parquet-read, parquet > Attachments: so-example.py > > > I want to read specific partitions from a partitioned parquet dataset. This > is very useful in case of large datasets. I have attached a small script > that creates a dataset and shows what is expected when reading (quoting > salient points below). > # There is no way to read specific partitions in Pandas > # In pyarrow I tried to achieve the goal by providing a list of > files/directories to ParquetDataset, but it didn't work: > # In PySpark it works if I simply do: > {code:none} > spark.read.options('basePath', 'datadir').parquet(*list_of_partitions) > {code} > I also couldn't find a way to easily write partitioned parquet files. In the > end I did it by hand by creating the directory hierarchies, and writing the > individual files myself (similar to the implementation in the attached > script). Again, in PySpark I can do > {code:none} > df.write.partitionBy(*list_of_partitions).parquet(output) > {code} > to achieve that. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1956: Fix Version/s: (was: 0.16.0) > [Python] Support reading specific partitions from a partitioned parquet > dataset > --- > > Key: ARROW-1956 > URL: https://issues.apache.org/jira/browse/ARROW-1956 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 > Environment: Kernel: 4.14.8-300.fc27.x86_64 > Python: 3.6.3 >Reporter: Suvayu Ali >Priority: Minor > Labels: parquet > Attachments: so-example.py > > > I want to read specific partitions from a partitioned parquet dataset. This > is very useful in case of large datasets. I have attached a small script > that creates a dataset and shows what is expected when reading (quoting > salient points below). > # There is no way to read specific partitions in Pandas > # In pyarrow I tried to achieve the goal by providing a list of > files/directories to ParquetDataset, but it didn't work: > # In PySpark it works if I simply do: > {code:none} > spark.read.options('basePath', 'datadir').parquet(*list_of_partitions) > {code} > I also couldn't find a way to easily write partitioned parquet files. In the > end I did it by hand by creating the directory hierarchies, and writing the > individual files myself (similar to the implementation in the attached > script). Again, in PySpark I can do > {code:none} > df.write.partitionBy(*list_of_partitions).parquet(output) > {code} > to achieve that. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1956: Fix Version/s: (was: 0.14.0) 1.0.0 > [Python] Support reading specific partitions from a partitioned parquet > dataset > --- > > Key: ARROW-1956 > URL: https://issues.apache.org/jira/browse/ARROW-1956 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 > Environment: Kernel: 4.14.8-300.fc27.x86_64 > Python: 3.6.3 >Reporter: Suvayu Ali >Priority: Minor > Labels: parquet > Fix For: 1.0.0 > > Attachments: so-example.py > > > I want to read specific partitions from a partitioned parquet dataset. This > is very useful in case of large datasets. I have attached a small script > that creates a dataset and shows what is expected when reading (quoting > salient points below). > # There is no way to read specific partitions in Pandas > # In pyarrow I tried to achieve the goal by providing a list of > files/directories to ParquetDataset, but it didn't work: > # In PySpark it works if I simply do: > {code:none} > spark.read.options('basePath', 'datadir').parquet(*list_of_partitions) > {code} > I also couldn't find a way to easily write partitioned parquet files. In the > end I did it by hand by creating the directory hierarchies, and writing the > individual files myself (similar to the implementation in the attached > script). Again, in PySpark I can do > {code:none} > df.write.partitionBy(*list_of_partitions).parquet(output) > {code} > to achieve that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1956: Fix Version/s: (was: 0.13.0) 0.14.0 > [Python] Support reading specific partitions from a partitioned parquet > dataset > --- > > Key: ARROW-1956 > URL: https://issues.apache.org/jira/browse/ARROW-1956 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 > Environment: Kernel: 4.14.8-300.fc27.x86_64 > Python: 3.6.3 >Reporter: Suvayu Ali >Priority: Minor > Labels: parquet > Fix For: 0.14.0 > > Attachments: so-example.py > > > I want to read specific partitions from a partitioned parquet dataset. This > is very useful in case of large datasets. I have attached a small script > that creates a dataset and shows what is expected when reading (quoting > salient points below). > # There is no way to read specific partitions in Pandas > # In pyarrow I tried to achieve the goal by providing a list of > files/directories to ParquetDataset, but it didn't work: > # In PySpark it works if I simply do: > {code:none} > spark.read.options('basePath', 'datadir').parquet(*list_of_partitions) > {code} > I also couldn't find a way to easily write partitioned parquet files. In the > end I did it by hand by creating the directory hierarchies, and writing the > individual files myself (similar to the implementation in the attached > script). Again, in PySpark I can do > {code:none} > df.write.partitionBy(*list_of_partitions).parquet(output) > {code} > to achieve that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1956: Component/s: (was: Format) Python > [Python] Support reading specific partitions from a partitioned parquet > dataset > --- > > Key: ARROW-1956 > URL: https://issues.apache.org/jira/browse/ARROW-1956 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 > Environment: Kernel: 4.14.8-300.fc27.x86_64 > Python: 3.6.3 >Reporter: Suvayu Ali >Priority: Minor > Labels: parquet > Fix For: 0.13.0 > > Attachments: so-example.py > > > I want to read specific partitions from a partitioned parquet dataset. This > is very useful in case of large datasets. I have attached a small script > that creates a dataset and shows what is expected when reading (quoting > salient points below). > # There is no way to read specific partitions in Pandas > # In pyarrow I tried to achieve the goal by providing a list of > files/directories to ParquetDataset, but it didn't work: > # In PySpark it works if I simply do: > {code:none} > spark.read.options('basePath', 'datadir').parquet(*list_of_partitions) > {code} > I also couldn't find a way to easily write partitioned parquet files. In the > end I did it by hand by creating the directory hierarchies, and writing the > individual files myself (similar to the implementation in the attached > script). Again, in PySpark I can do > {code:none} > df.write.partitionBy(*list_of_partitions).parquet(output) > {code} > to achieve that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1956: Fix Version/s: (was: 0.12.0) 0.13.0 > [Python] Support reading specific partitions from a partitioned parquet > dataset > --- > > Key: ARROW-1956 > URL: https://issues.apache.org/jira/browse/ARROW-1956 > Project: Apache Arrow > Issue Type: Improvement > Components: Format >Affects Versions: 0.8.0 > Environment: Kernel: 4.14.8-300.fc27.x86_64 > Python: 3.6.3 >Reporter: Suvayu Ali >Priority: Minor > Labels: parquet > Fix For: 0.13.0 > > Attachments: so-example.py > > > I want to read specific partitions from a partitioned parquet dataset. This > is very useful in case of large datasets. I have attached a small script > that creates a dataset and shows what is expected when reading (quoting > salient points below). > # There is no way to read specific partitions in Pandas > # In pyarrow I tried to achieve the goal by providing a list of > files/directories to ParquetDataset, but it didn't work: > # In PySpark it works if I simply do: > {code:none} > spark.read.options('basePath', 'datadir').parquet(*list_of_partitions) > {code} > I also couldn't find a way to easily write partitioned parquet files. In the > end I did it by hand by creating the directory hierarchies, and writing the > individual files myself (similar to the implementation in the attached > script). Again, in PySpark I can do > {code:none} > df.write.partitionBy(*list_of_partitions).parquet(output) > {code} > to achieve that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1956: Fix Version/s: (was: 0.11.0) 0.12.0 > [Python] Support reading specific partitions from a partitioned parquet > dataset > --- > > Key: ARROW-1956 > URL: https://issues.apache.org/jira/browse/ARROW-1956 > Project: Apache Arrow > Issue Type: Improvement > Components: Format >Affects Versions: 0.8.0 > Environment: Kernel: 4.14.8-300.fc27.x86_64 > Python: 3.6.3 >Reporter: Suvayu Ali >Priority: Minor > Labels: parquet > Fix For: 0.12.0 > > Attachments: so-example.py > > > I want to read specific partitions from a partitioned parquet dataset. This > is very useful in case of large datasets. I have attached a small script > that creates a dataset and shows what is expected when reading (quoting > salient points below). > # There is no way to read specific partitions in Pandas > # In pyarrow I tried to achieve the goal by providing a list of > files/directories to ParquetDataset, but it didn't work: > # In PySpark it works if I simply do: > {code:none} > spark.read.options('basePath', 'datadir').parquet(*list_of_partitions) > {code} > I also couldn't find a way to easily write partitioned parquet files. In the > end I did it by hand by creating the directory hierarchies, and writing the > individual files myself (similar to the implementation in the attached > script). Again, in PySpark I can do > {code:none} > df.write.partitionBy(*list_of_partitions).parquet(output) > {code} > to achieve that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1956: Fix Version/s: (was: 0.10.0) 0.11.0 > [Python] Support reading specific partitions from a partitioned parquet > dataset > --- > > Key: ARROW-1956 > URL: https://issues.apache.org/jira/browse/ARROW-1956 > Project: Apache Arrow > Issue Type: Improvement > Components: Format >Affects Versions: 0.8.0 > Environment: Kernel: 4.14.8-300.fc27.x86_64 > Python: 3.6.3 >Reporter: Suvayu Ali >Priority: Minor > Labels: parquet > Fix For: 0.11.0 > > Attachments: so-example.py > > > I want to read specific partitions from a partitioned parquet dataset. This > is very useful in case of large datasets. I have attached a small script > that creates a dataset and shows what is expected when reading (quoting > salient points below). > # There is no way to read specific partitions in Pandas > # In pyarrow I tried to achieve the goal by providing a list of > files/directories to ParquetDataset, but it didn't work: > # In PySpark it works if I simply do: > {code:none} > spark.read.options('basePath', 'datadir').parquet(*list_of_partitions) > {code} > I also couldn't find a way to easily write partitioned parquet files. In the > end I did it by hand by creating the directory hierarchies, and writing the > individual files myself (similar to the implementation in the attached > script). Again, in PySpark I can do > {code:none} > df.write.partitionBy(*list_of_partitions).parquet(output) > {code} > to achieve that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1956: Fix Version/s: (was: 0.9.0) 0.10.0 > [Python] Support reading specific partitions from a partitioned parquet > dataset > --- > > Key: ARROW-1956 > URL: https://issues.apache.org/jira/browse/ARROW-1956 > Project: Apache Arrow > Issue Type: Improvement > Components: Format >Affects Versions: 0.8.0 > Environment: Kernel: 4.14.8-300.fc27.x86_64 > Python: 3.6.3 >Reporter: Suvayu Ali >Priority: Minor > Labels: parquet > Fix For: 0.10.0 > > Attachments: so-example.py > > > I want to read specific partitions from a partitioned parquet dataset. This > is very useful in case of large datasets. I have attached a small script > that creates a dataset and shows what is expected when reading (quoting > salient points below). > # There is no way to read specific partitions in Pandas > # In pyarrow I tried to achieve the goal by providing a list of > files/directories to ParquetDataset, but it didn't work: > # In PySpark it works if I simply do: > {code:none} > spark.read.options('basePath', 'datadir').parquet(*list_of_partitions) > {code} > I also couldn't find a way to easily write partitioned parquet files. In the > end I did it by hand by creating the directory hierarchies, and writing the > individual files myself (similar to the implementation in the attached > script). Again, in PySpark I can do > {code:none} > df.write.partitionBy(*list_of_partitions).parquet(output) > {code} > to achieve that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1956: Summary: [Python] Support reading specific partitions from a partitioned parquet dataset (was: Support reading specific partitions from a partitioned parquet dataset) > [Python] Support reading specific partitions from a partitioned parquet > dataset > --- > > Key: ARROW-1956 > URL: https://issues.apache.org/jira/browse/ARROW-1956 > Project: Apache Arrow > Issue Type: Improvement > Components: Format >Affects Versions: 0.8.0 > Environment: Kernel: 4.14.8-300.fc27.x86_64 > Python: 3.6.3 >Reporter: Suvayu Ali >Priority: Minor > Labels: parquet > Fix For: 0.9.0 > > Attachments: so-example.py > > > I want to read specific partitions from a partitioned parquet dataset. This > is very useful in case of large datasets. I have attached a small script > that creates a dataset and shows what is expected when reading (quoting > salient points below). > # There is no way to read specific partitions in Pandas > # In pyarrow I tried to achieve the goal by providing a list of > files/directories to ParquetDataset, but it didn't work: > # In PySpark it works if I simply do: > {code:none} > spark.read.options('basePath', 'datadir').parquet(*list_of_partitions) > {code} > I also couldn't find a way to easily write partitioned parquet files. In the > end I did it by hand by creating the directory hierarchies, and writing the > individual files myself (similar to the implementation in the attached > script). Again, in PySpark I can do > {code:none} > df.write.partitionBy(*list_of_partitions).parquet(output) > {code} > to achieve that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)