[jira] [Commented] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)

2018-01-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16345598#comment-16345598
 ] 

Thomas Bünger commented on SPARK-12394:
---

Any news on this issue? Is it really fixed? I also can't find a corresponding 
pull request.

> Support writing out pre-hash-partitioned data and exploit that in join 
> optimizations to avoid shuffle (i.e. bucketing in Hive)
> --
>
> Key: SPARK-12394
> URL: https://issues.apache.org/jira/browse/SPARK-12394
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Nong Li
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: BucketedTables.pdf
>
>
> In many cases users know ahead of time the columns that they will be joining 
> or aggregating on.  Ideally they should be able to leverage this information 
> and pre-shuffle the data so that subsequent queries do not require a shuffle. 
>  Hive supports this functionality by allowing the user to define buckets, 
> which are hash partitioning of the data based on some key.
>  - Allow the user to specify a set of columns when caching or writing out data
>  - Allow the user to specify some parallelism
>  - Shuffle the data when writing / caching such that its distributed by these 
> columns
>  - When planning/executing  a query, use this distribution to avoid another 
> shuffle when reading, assuming the join or aggregation is compatible with the 
> columns specified
>  - Should work with existing save modes: append, overwrite, etc
>  - Should work at least with all Hadoops FS data sources
>  - Should work with any data source when caching



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)

2017-10-06 Thread Jeetendra Nihalani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195004#comment-16195004
 ] 

Jeetendra Nihalani commented on SPARK-12394:


Is this issue resolved? If yes, can someone update this with a pull request 
which resolved the issue.

> Support writing out pre-hash-partitioned data and exploit that in join 
> optimizations to avoid shuffle (i.e. bucketing in Hive)
> --
>
> Key: SPARK-12394
> URL: https://issues.apache.org/jira/browse/SPARK-12394
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Nong Li
> Fix For: 2.0.0
>
> Attachments: BucketedTables.pdf
>
>
> In many cases users know ahead of time the columns that they will be joining 
> or aggregating on.  Ideally they should be able to leverage this information 
> and pre-shuffle the data so that subsequent queries do not require a shuffle. 
>  Hive supports this functionality by allowing the user to define buckets, 
> which are hash partitioning of the data based on some key.
>  - Allow the user to specify a set of columns when caching or writing out data
>  - Allow the user to specify some parallelism
>  - Shuffle the data when writing / caching such that its distributed by these 
> columns
>  - When planning/executing  a query, use this distribution to avoid another 
> shuffle when reading, assuming the join or aggregation is compatible with the 
> columns specified
>  - Should work with existing save modes: append, overwrite, etc
>  - Should work at least with all Hadoops FS data sources
>  - Should work with any data source when caching



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)

2016-08-27 Thread Darren Fu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15441782#comment-15441782
 ] 

Darren Fu commented on SPARK-12394:
---

Good news, Tejas!
I think SMB join is a required feature to make SparkSQL to compete with 
traditional MPP databases (such as Teradata and DB2) on large table join. I 
have benchmarked that Hive did well in such map-side join, and Spark should 
support that as well.
I noticed that there is a bucketBy() method implemented in Spark 2.0: 
https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L168-L171
What is used for now?

> Support writing out pre-hash-partitioned data and exploit that in join 
> optimizations to avoid shuffle (i.e. bucketing in Hive)
> --
>
> Key: SPARK-12394
> URL: https://issues.apache.org/jira/browse/SPARK-12394
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Nong Li
> Fix For: 2.0.0
>
> Attachments: BucketedTables.pdf
>
>
> In many cases users know ahead of time the columns that they will be joining 
> or aggregating on.  Ideally they should be able to leverage this information 
> and pre-shuffle the data so that subsequent queries do not require a shuffle. 
>  Hive supports this functionality by allowing the user to define buckets, 
> which are hash partitioning of the data based on some key.
>  - Allow the user to specify a set of columns when caching or writing out data
>  - Allow the user to specify some parallelism
>  - Shuffle the data when writing / caching such that its distributed by these 
> columns
>  - When planning/executing  a query, use this distribution to avoid another 
> shuffle when reading, assuming the join or aggregation is compatible with the 
> columns specified
>  - Should work with existing save modes: append, overwrite, etc
>  - Should work at least with all Hadoops FS data sources
>  - Should work with any data source when caching



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)

2016-08-25 Thread Alexander Tronchin-James (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438461#comment-15438461
 ] 

Alexander Tronchin-James commented on SPARK-12394:
--

Awesome news Tejas!

The filter feature is secondary AFAIK, and I'd prioritize the sorted-merge
bucketed-map (SMB) join if I had the choice. Strong preference for
supporting all of inner and left/right/full outer joins between tables with
integer multiple differences in the number of buckets, selecting the number
of executors (a rational multiple or fraction of the number of buckets),
and also selecting the number of emitted buckets. Bonus points for an
implementation that automatically applies SMB joins and avoids
re-sorts where possible. Maybe a tall order, but we know it can be done.
;-)

If we don't get it all in the first pull request we can always iterate.
Thanks for pushing on this!






> Support writing out pre-hash-partitioned data and exploit that in join 
> optimizations to avoid shuffle (i.e. bucketing in Hive)
> --
>
> Key: SPARK-12394
> URL: https://issues.apache.org/jira/browse/SPARK-12394
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Nong Li
> Fix For: 2.0.0
>
> Attachments: BucketedTables.pdf
>
>
> In many cases users know ahead of time the columns that they will be joining 
> or aggregating on.  Ideally they should be able to leverage this information 
> and pre-shuffle the data so that subsequent queries do not require a shuffle. 
>  Hive supports this functionality by allowing the user to define buckets, 
> which are hash partitioning of the data based on some key.
>  - Allow the user to specify a set of columns when caching or writing out data
>  - Allow the user to specify some parallelism
>  - Shuffle the data when writing / caching such that its distributed by these 
> columns
>  - When planning/executing  a query, use this distribution to avoid another 
> shuffle when reading, assuming the join or aggregation is compatible with the 
> columns specified
>  - Should work with existing save modes: append, overwrite, etc
>  - Should work at least with all Hadoops FS data sources
>  - Should work with any data source when caching



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)

2016-08-25 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438423#comment-15438423
 ] 

Tejas Patil commented on SPARK-12394:
-

[~alex.n.ja...@gmail.com] [~kotim...@gmail.com] : The doc lists two things for 
future in the end. I am working on the last one : 
https://issues.apache.org/jira/browse/SPARK-15453. I am not sure if the `Filter 
on sorted data` one is already being worked on but I can work on that as well 
(just created a jira for that : 
https://issues.apache.org/jira/browse/SPARK-17254)

> Support writing out pre-hash-partitioned data and exploit that in join 
> optimizations to avoid shuffle (i.e. bucketing in Hive)
> --
>
> Key: SPARK-12394
> URL: https://issues.apache.org/jira/browse/SPARK-12394
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Nong Li
> Fix For: 2.0.0
>
> Attachments: BucketedTables.pdf
>
>
> In many cases users know ahead of time the columns that they will be joining 
> or aggregating on.  Ideally they should be able to leverage this information 
> and pre-shuffle the data so that subsequent queries do not require a shuffle. 
>  Hive supports this functionality by allowing the user to define buckets, 
> which are hash partitioning of the data based on some key.
>  - Allow the user to specify a set of columns when caching or writing out data
>  - Allow the user to specify some parallelism
>  - Shuffle the data when writing / caching such that its distributed by these 
> columns
>  - When planning/executing  a query, use this distribution to avoid another 
> shuffle when reading, assuming the join or aggregation is compatible with the 
> columns specified
>  - Should work with existing save modes: append, overwrite, etc
>  - Should work at least with all Hadoops FS data sources
>  - Should work with any data source when caching



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)

2016-08-25 Thread Darren Fu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438317#comment-15438317
 ] 

Darren Fu commented on SPARK-12394:
---

Any luck to see this feature implemented in v2.0?

> Support writing out pre-hash-partitioned data and exploit that in join 
> optimizations to avoid shuffle (i.e. bucketing in Hive)
> --
>
> Key: SPARK-12394
> URL: https://issues.apache.org/jira/browse/SPARK-12394
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Nong Li
> Fix For: 2.0.0
>
> Attachments: BucketedTables.pdf
>
>
> In many cases users know ahead of time the columns that they will be joining 
> or aggregating on.  Ideally they should be able to leverage this information 
> and pre-shuffle the data so that subsequent queries do not require a shuffle. 
>  Hive supports this functionality by allowing the user to define buckets, 
> which are hash partitioning of the data based on some key.
>  - Allow the user to specify a set of columns when caching or writing out data
>  - Allow the user to specify some parallelism
>  - Shuffle the data when writing / caching such that its distributed by these 
> columns
>  - When planning/executing  a query, use this distribution to avoid another 
> shuffle when reading, assuming the join or aggregation is compatible with the 
> columns specified
>  - Should work with existing save modes: append, overwrite, etc
>  - Should work at least with all Hadoops FS data sources
>  - Should work with any data source when caching



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)

2016-08-22 Thread Alexander Tronchin-James (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431322#comment-15431322
 ] 

Alexander Tronchin-James commented on SPARK-12394:
--

Where can we read about/contribute to efforts for implementing the filter on 
sorted data and optimized sort merge join strategies mentioned in the attached 
BucketedTables.pdf? Looking forward to these features!

> Support writing out pre-hash-partitioned data and exploit that in join 
> optimizations to avoid shuffle (i.e. bucketing in Hive)
> --
>
> Key: SPARK-12394
> URL: https://issues.apache.org/jira/browse/SPARK-12394
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Nong Li
> Fix For: 2.0.0
>
> Attachments: BucketedTables.pdf
>
>
> In many cases users know ahead of time the columns that they will be joining 
> or aggregating on.  Ideally they should be able to leverage this information 
> and pre-shuffle the data so that subsequent queries do not require a shuffle. 
>  Hive supports this functionality by allowing the user to define buckets, 
> which are hash partitioning of the data based on some key.
>  - Allow the user to specify a set of columns when caching or writing out data
>  - Allow the user to specify some parallelism
>  - Shuffle the data when writing / caching such that its distributed by these 
> columns
>  - When planning/executing  a query, use this distribution to avoid another 
> shuffle when reading, assuming the join or aggregation is compatible with the 
> columns specified
>  - Should work with existing save modes: append, overwrite, etc
>  - Should work at least with all Hadoops FS data sources
>  - Should work with any data source when caching



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org