[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset

2020-03-14 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1956:

Labels: dataset dataset-parquet-read parquet  (was: dataset-parquet-read 
parquet)

> [Python] Support reading specific partitions from a partitioned parquet 
> dataset
> ---
>
> Key: ARROW-1956
> URL: https://issues.apache.org/jira/browse/ARROW-1956
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: dataset, dataset-parquet-read, parquet
> Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset.  This 
> is very useful in case of large datasets.  I have attached a small script 
> that creates a dataset and shows what is expected when reading (quoting 
> salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of 
> files/directories to ParquetDataset, but it didn't work: 
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files.  In the 
> end I did it by hand by creating the directory hierarchies, and writing the 
> individual files myself (similar to the implementation in the attached 
> script).  Again, in PySpark I can do 
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset

2020-03-12 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-1956:
-
Labels: dataset-parquet-read parquet  (was: parquet)

> [Python] Support reading specific partitions from a partitioned parquet 
> dataset
> ---
>
> Key: ARROW-1956
> URL: https://issues.apache.org/jira/browse/ARROW-1956
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: dataset-parquet-read, parquet
> Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset.  This 
> is very useful in case of large datasets.  I have attached a small script 
> that creates a dataset and shows what is expected when reading (quoting 
> salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of 
> files/directories to ParquetDataset, but it didn't work: 
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files.  In the 
> end I did it by hand by creating the directory hierarchies, and writing the 
> individual files myself (similar to the implementation in the attached 
> script).  Again, in PySpark I can do 
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1956:

Fix Version/s: (was: 0.16.0)

> [Python] Support reading specific partitions from a partitioned parquet 
> dataset
> ---
>
> Key: ARROW-1956
> URL: https://issues.apache.org/jira/browse/ARROW-1956
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: parquet
> Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset.  This 
> is very useful in case of large datasets.  I have attached a small script 
> that creates a dataset and shows what is expected when reading (quoting 
> salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of 
> files/directories to ParquetDataset, but it didn't work: 
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files.  In the 
> end I did it by hand by creating the directory hierarchies, and writing the 
> individual files myself (similar to the implementation in the attached 
> script).  Again, in PySpark I can do 
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset

2019-06-11 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1956:

Fix Version/s: (was: 0.14.0)
   1.0.0

> [Python] Support reading specific partitions from a partitioned parquet 
> dataset
> ---
>
> Key: ARROW-1956
> URL: https://issues.apache.org/jira/browse/ARROW-1956
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: parquet
> Fix For: 1.0.0
>
> Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset.  This 
> is very useful in case of large datasets.  I have attached a small script 
> that creates a dataset and shows what is expected when reading (quoting 
> salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of 
> files/directories to ParquetDataset, but it didn't work: 
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files.  In the 
> end I did it by hand by creating the directory hierarchies, and writing the 
> individual files myself (similar to the implementation in the attached 
> script).  Again, in PySpark I can do 
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset

2019-02-27 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1956:

Fix Version/s: (was: 0.13.0)
   0.14.0

> [Python] Support reading specific partitions from a partitioned parquet 
> dataset
> ---
>
> Key: ARROW-1956
> URL: https://issues.apache.org/jira/browse/ARROW-1956
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: parquet
> Fix For: 0.14.0
>
> Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset.  This 
> is very useful in case of large datasets.  I have attached a small script 
> that creates a dataset and shows what is expected when reading (quoting 
> salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of 
> files/directories to ParquetDataset, but it didn't work: 
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files.  In the 
> end I did it by hand by creating the directory hierarchies, and writing the 
> individual files myself (similar to the implementation in the attached 
> script).  Again, in PySpark I can do 
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset

2018-11-29 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1956:

Component/s: (was: Format)
 Python

> [Python] Support reading specific partitions from a partitioned parquet 
> dataset
> ---
>
> Key: ARROW-1956
> URL: https://issues.apache.org/jira/browse/ARROW-1956
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: parquet
> Fix For: 0.13.0
>
> Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset.  This 
> is very useful in case of large datasets.  I have attached a small script 
> that creates a dataset and shows what is expected when reading (quoting 
> salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of 
> files/directories to ParquetDataset, but it didn't work: 
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files.  In the 
> end I did it by hand by creating the directory hierarchies, and writing the 
> individual files myself (similar to the implementation in the attached 
> script).  Again, in PySpark I can do 
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset

2018-11-14 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1956:

Fix Version/s: (was: 0.12.0)
   0.13.0

> [Python] Support reading specific partitions from a partitioned parquet 
> dataset
> ---
>
> Key: ARROW-1956
> URL: https://issues.apache.org/jira/browse/ARROW-1956
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Affects Versions: 0.8.0
> Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: parquet
> Fix For: 0.13.0
>
> Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset.  This 
> is very useful in case of large datasets.  I have attached a small script 
> that creates a dataset and shows what is expected when reading (quoting 
> salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of 
> files/directories to ParquetDataset, but it didn't work: 
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files.  In the 
> end I did it by hand by creating the directory hierarchies, and writing the 
> individual files myself (similar to the implementation in the attached 
> script).  Again, in PySpark I can do 
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset

2018-09-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1956:

Fix Version/s: (was: 0.11.0)
   0.12.0

> [Python] Support reading specific partitions from a partitioned parquet 
> dataset
> ---
>
> Key: ARROW-1956
> URL: https://issues.apache.org/jira/browse/ARROW-1956
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Affects Versions: 0.8.0
> Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: parquet
> Fix For: 0.12.0
>
> Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset.  This 
> is very useful in case of large datasets.  I have attached a small script 
> that creates a dataset and shows what is expected when reading (quoting 
> salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of 
> files/directories to ParquetDataset, but it didn't work: 
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files.  In the 
> end I did it by hand by creating the directory hierarchies, and writing the 
> individual files myself (similar to the implementation in the attached 
> script).  Again, in PySpark I can do 
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset

2018-07-06 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1956:

Fix Version/s: (was: 0.10.0)
   0.11.0

> [Python] Support reading specific partitions from a partitioned parquet 
> dataset
> ---
>
> Key: ARROW-1956
> URL: https://issues.apache.org/jira/browse/ARROW-1956
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Affects Versions: 0.8.0
> Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: parquet
> Fix For: 0.11.0
>
> Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset.  This 
> is very useful in case of large datasets.  I have attached a small script 
> that creates a dataset and shows what is expected when reading (quoting 
> salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of 
> files/directories to ParquetDataset, but it didn't work: 
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files.  In the 
> end I did it by hand by creating the directory hierarchies, and writing the 
> individual files myself (similar to the implementation in the attached 
> script).  Again, in PySpark I can do 
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1956:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Support reading specific partitions from a partitioned parquet 
> dataset
> ---
>
> Key: ARROW-1956
> URL: https://issues.apache.org/jira/browse/ARROW-1956
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Affects Versions: 0.8.0
> Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: parquet
> Fix For: 0.10.0
>
> Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset.  This 
> is very useful in case of large datasets.  I have attached a small script 
> that creates a dataset and shows what is expected when reading (quoting 
> salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of 
> files/directories to ParquetDataset, but it didn't work: 
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files.  In the 
> end I did it by hand by creating the directory hierarchies, and writing the 
> individual files myself (similar to the implementation in the attached 
> script).  Again, in PySpark I can do 
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset

2018-01-23 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1956:

Summary: [Python] Support reading specific partitions from a partitioned 
parquet dataset  (was: Support reading specific partitions from a partitioned 
parquet dataset)

> [Python] Support reading specific partitions from a partitioned parquet 
> dataset
> ---
>
> Key: ARROW-1956
> URL: https://issues.apache.org/jira/browse/ARROW-1956
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Affects Versions: 0.8.0
> Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: parquet
> Fix For: 0.9.0
>
> Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset.  This 
> is very useful in case of large datasets.  I have attached a small script 
> that creates a dataset and shows what is expected when reading (quoting 
> salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of 
> files/directories to ParquetDataset, but it didn't work: 
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files.  In the 
> end I did it by hand by creating the directory hierarchies, and writing the 
> individual files myself (similar to the implementation in the attached 
> script).  Again, in PySpark I can do 
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)