[jira] [Comment Edited] (SPARK-19256) Hive bucketing support

2018-02-02 Thread Fernando Pereira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16345397#comment-16345397
 ] 

Fernando Pereira edited comment on SPARK-19256 at 2/2/18 8:50 AM:
--

Thanks a lot for this great contribution to Spark.

I was just wondering, would it make sense to apply this to direct outputs (e.g. 
write.parquet()), so that we could keep partitioning information - and again 
avoid reshuffling data before a merge? I believe this is most what 
saveAsTable() does by default in Spark, but to my mind it would improve the 
DataFrame write API and make these performance benefits more accessible.

Update:
I've just noticed that it has been considered in 
[https://github.com/apache/spark/pull/13452.
] [~cloud_fan] [ |https://github.com/apache/spark/pull/13452.]- Is there an 
Issue to follow up on this feature? Eventually we could simply store a metadata 
json file together with the data files.


was (Author: ferdonline):
Thanks a lot for this great contribution to Spark.

I was just wondering, would it make sense to apply this to direct outputs (e.g. 
write.parquet()), so that we could keep partitioning information - and again 
avoid reshuffling data before a merge? I believe this is most what 
saveAsTable() does by default in Spark, but to my mind it would improve the 
DataFrame write API and make these performance benefits more accessible.

> Hive bucketing support
> --
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Priority: Minor
>
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19256) Hive bucketing support

2018-01-30 Thread Fernando Pereira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16345397#comment-16345397
 ] 

Fernando Pereira edited comment on SPARK-19256 at 1/30/18 5:16 PM:
---

Thanks a lot for this great contribution to Spark.

I was just wondering, would it make sense to apply this to direct outputs (e.g. 
write.parquet()), so that we could keep partitioning information - and again 
avoid reshuffling data before a merge? I believe this is most what 
saveAsTable() does by default in Spark, but to my mind it would improve the 
DataFrame write API and make these performance benefits more accessible.


was (Author: ferdonline):
Thanks a lot for this great contribution to Spark.

 

I was just wondering, would it make sense to apply this to direct outputs (e.g. 
write.parquet()), so that we could keep partitioning information - and again 
avoid reshuffling data before a merge? I believe this is most what 
saveAsTable() does by default in Spark, but to my mind it would improve the 
DataFrame write API and make these performance benefits more accessible.

> Hive bucketing support
> --
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Priority: Minor
>
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org