[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez

2015-07-21 Thread Lefty Leverenz (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lefty Leverenz updated HIVE-10673:
--
Labels: TODOC1.3  (was: )

 Dynamically partitioned hash join for Tez
 -

 Key: HIVE-10673
 URL: https://issues.apache.org/jira/browse/HIVE-10673
 Project: Hive
  Issue Type: New Feature
  Components: Query Planning, Query Processor
Reporter: Jason Dere
Assignee: Jason Dere
  Labels: TODOC1.3
 Fix For: 1.3.0, 2.0.0

 Attachments: HIVE-10673.1.patch, HIVE-10673.10.patch, 
 HIVE-10673.11.patch, HIVE-10673.12, HIVE-10673.2.patch, HIVE-10673.3.patch, 
 HIVE-10673.4.patch, HIVE-10673.5.patch, HIVE-10673.6.patch, 
 HIVE-10673.7.patch, HIVE-10673.8.patch, HIVE-10673.9.patch


 Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 
 2/3 of the CPU was spent during sorting/merging.
 While this does not work for MR, for other execution engines (such as Tez), 
 it is possible to create a reduce-side join that uses unsorted inputs in 
 order to eliminate the sorting, which may be faster than a shuffle join. To 
 join on unsorted inputs, we can use the hash join algorithm to perform the 
 join in the reducer. This will require the small tables in the join to fit in 
 the reducer/hash table for this to work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez

2015-07-21 Thread Jason Dere (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dere updated HIVE-10673:
--
Release Note: This adds configuration parameter 
hive.optimize.dynamic.partition.hashjoin, which enables selection of the 
dynamically partitioned hash join with the Tez execution engine

 Dynamically partitioned hash join for Tez
 -

 Key: HIVE-10673
 URL: https://issues.apache.org/jira/browse/HIVE-10673
 Project: Hive
  Issue Type: New Feature
  Components: Query Planning, Query Processor
Reporter: Jason Dere
Assignee: Jason Dere
 Fix For: 1.3.0, 2.0.0

 Attachments: HIVE-10673.1.patch, HIVE-10673.10.patch, 
 HIVE-10673.11.patch, HIVE-10673.12, HIVE-10673.2.patch, HIVE-10673.3.patch, 
 HIVE-10673.4.patch, HIVE-10673.5.patch, HIVE-10673.6.patch, 
 HIVE-10673.7.patch, HIVE-10673.8.patch, HIVE-10673.9.patch


 Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 
 2/3 of the CPU was spent during sorting/merging.
 While this does not work for MR, for other execution engines (such as Tez), 
 it is possible to create a reduce-side join that uses unsorted inputs in 
 order to eliminate the sorting, which may be faster than a shuffle join. To 
 join on unsorted inputs, we can use the hash join algorithm to perform the 
 join in the reducer. This will require the small tables in the join to fit in 
 the reducer/hash table for this to work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez

2015-07-20 Thread Jason Dere (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dere updated HIVE-10673:
--
Attachment: HIVE-10673.12

patch v12: rebase with trunk, adding comment per Vikram's feedback.

 Dynamically partitioned hash join for Tez
 -

 Key: HIVE-10673
 URL: https://issues.apache.org/jira/browse/HIVE-10673
 Project: Hive
  Issue Type: New Feature
  Components: Query Planning, Query Processor
Reporter: Jason Dere
Assignee: Jason Dere
 Attachments: HIVE-10673.1.patch, HIVE-10673.10.patch, 
 HIVE-10673.11.patch, HIVE-10673.12, HIVE-10673.2.patch, HIVE-10673.3.patch, 
 HIVE-10673.4.patch, HIVE-10673.5.patch, HIVE-10673.6.patch, 
 HIVE-10673.7.patch, HIVE-10673.8.patch, HIVE-10673.9.patch


 Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 
 2/3 of the CPU was spent during sorting/merging.
 While this does not work for MR, for other execution engines (such as Tez), 
 it is possible to create a reduce-side join that uses unsorted inputs in 
 order to eliminate the sorting, which may be faster than a shuffle join. To 
 join on unsorted inputs, we can use the hash join algorithm to perform the 
 join in the reducer. This will require the small tables in the join to fit in 
 the reducer/hash table for this to work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez

2015-07-13 Thread Jason Dere (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dere updated HIVE-10673:
--
Attachment: HIVE-10673.11.patch

The golden files added in this patch needed to be regenerated after HIVE-11152. 
Attaching patch v11.
TestCliDriver.testCliDriver_index_auto_mult_tables_compact and 
TestJdbcWithLocalClusterSpark.testTempTable do not fail when I run it locally 
with the patch.

 Dynamically partitioned hash join for Tez
 -

 Key: HIVE-10673
 URL: https://issues.apache.org/jira/browse/HIVE-10673
 Project: Hive
  Issue Type: New Feature
  Components: Query Planning, Query Processor
Reporter: Jason Dere
Assignee: Jason Dere
 Attachments: HIVE-10673.1.patch, HIVE-10673.10.patch, 
 HIVE-10673.11.patch, HIVE-10673.2.patch, HIVE-10673.3.patch, 
 HIVE-10673.4.patch, HIVE-10673.5.patch, HIVE-10673.6.patch, 
 HIVE-10673.7.patch, HIVE-10673.8.patch, HIVE-10673.9.patch


 Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 
 2/3 of the CPU was spent during sorting/merging.
 While this does not work for MR, for other execution engines (such as Tez), 
 it is possible to create a reduce-side join that uses unsorted inputs in 
 order to eliminate the sorting, which may be faster than a shuffle join. To 
 join on unsorted inputs, we can use the hash join algorithm to perform the 
 join in the reducer. This will require the small tables in the join to fit in 
 the reducer/hash table for this to work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez

2015-07-10 Thread Jason Dere (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dere updated HIVE-10673:
--
Attachment: HIVE-10673.10.patch

Patch v10:
- Rebase with trunk, looks like some methods in GenTezUtils were converted to 
static
- When selecting distributed hash join, the join operator should get 
OpTraits/stats set
- For the issue regarding the flattened expressions in the vectorized 
rowObjectInspector, change the workaround to un-flatten the object inspector 
during JoinUtil.getObjectInspectorsFromEvaluators(). This is still a bit of a 
workaround, but only requires a change in 1 place, rather than the 2 changes 
needed in the previous solution (having to modify the column names during 
vectorized MapJoinOperator, as well as when generating the vectorized 
rowObjectInspector in VectorizedBatchUtil)
- In the reducer, only the big table's input source should be vectorized

 Dynamically partitioned hash join for Tez
 -

 Key: HIVE-10673
 URL: https://issues.apache.org/jira/browse/HIVE-10673
 Project: Hive
  Issue Type: New Feature
  Components: Query Planning, Query Processor
Reporter: Jason Dere
Assignee: Jason Dere
 Attachments: HIVE-10673.1.patch, HIVE-10673.10.patch, 
 HIVE-10673.2.patch, HIVE-10673.3.patch, HIVE-10673.4.patch, 
 HIVE-10673.5.patch, HIVE-10673.6.patch, HIVE-10673.7.patch, 
 HIVE-10673.8.patch, HIVE-10673.9.patch


 Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 
 2/3 of the CPU was spent during sorting/merging.
 While this does not work for MR, for other execution engines (such as Tez), 
 it is possible to create a reduce-side join that uses unsorted inputs in 
 order to eliminate the sorting, which may be faster than a shuffle join. To 
 join on unsorted inputs, we can use the hash join algorithm to perform the 
 join in the reducer. This will require the small tables in the join to fit in 
 the reducer/hash table for this to work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez

2015-07-06 Thread Jason Dere (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dere updated HIVE-10673:
--
Attachment: HIVE-10673.9.patch

Precommit tests never ran - re-uploading patch

 Dynamically partitioned hash join for Tez
 -

 Key: HIVE-10673
 URL: https://issues.apache.org/jira/browse/HIVE-10673
 Project: Hive
  Issue Type: New Feature
  Components: Query Planning, Query Processor
Reporter: Jason Dere
Assignee: Jason Dere
 Attachments: HIVE-10673.1.patch, HIVE-10673.2.patch, 
 HIVE-10673.3.patch, HIVE-10673.4.patch, HIVE-10673.5.patch, 
 HIVE-10673.6.patch, HIVE-10673.7.patch, HIVE-10673.8.patch, HIVE-10673.9.patch


 Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 
 2/3 of the CPU was spent during sorting/merging.
 While this does not work for MR, for other execution engines (such as Tez), 
 it is possible to create a reduce-side join that uses unsorted inputs in 
 order to eliminate the sorting, which may be faster than a shuffle join. To 
 join on unsorted inputs, we can use the hash join algorithm to perform the 
 join in the reducer. This will require the small tables in the join to fit in 
 the reducer/hash table for this to work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez

2015-07-02 Thread Jason Dere (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dere updated HIVE-10673:
--
Attachment: HIVE-10673.8.patch

Fixing failure in tez_smb_1.q - the big table position in 
CommonMergeJoinOperator and the ReduceWork were different, they need to be 
consistent for the merge join to work properly.

 Dynamically partitioned hash join for Tez
 -

 Key: HIVE-10673
 URL: https://issues.apache.org/jira/browse/HIVE-10673
 Project: Hive
  Issue Type: New Feature
  Components: Query Planning, Query Processor
Reporter: Jason Dere
Assignee: Jason Dere
 Attachments: HIVE-10673.1.patch, HIVE-10673.2.patch, 
 HIVE-10673.3.patch, HIVE-10673.4.patch, HIVE-10673.5.patch, 
 HIVE-10673.6.patch, HIVE-10673.7.patch, HIVE-10673.8.patch


 Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 
 2/3 of the CPU was spent during sorting/merging.
 While this does not work for MR, for other execution engines (such as Tez), 
 it is possible to create a reduce-side join that uses unsorted inputs in 
 order to eliminate the sorting, which may be faster than a shuffle join. To 
 join on unsorted inputs, we can use the hash join algorithm to perform the 
 join in the reducer. This will require the small tables in the join to fit in 
 the reducer/hash table for this to work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez

2015-07-01 Thread Jason Dere (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dere updated HIVE-10673:
--
Attachment: HIVE-10673.7.patch

Update unit tests

 Dynamically partitioned hash join for Tez
 -

 Key: HIVE-10673
 URL: https://issues.apache.org/jira/browse/HIVE-10673
 Project: Hive
  Issue Type: New Feature
  Components: Query Planning, Query Processor
Reporter: Jason Dere
Assignee: Jason Dere
 Attachments: HIVE-10673.1.patch, HIVE-10673.2.patch, 
 HIVE-10673.3.patch, HIVE-10673.4.patch, HIVE-10673.5.patch, 
 HIVE-10673.6.patch, HIVE-10673.7.patch


 Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 
 2/3 of the CPU was spent during sorting/merging.
 While this does not work for MR, for other execution engines (such as Tez), 
 it is possible to create a reduce-side join that uses unsorted inputs in 
 order to eliminate the sorting, which may be faster than a shuffle join. To 
 join on unsorted inputs, we can use the hash join algorithm to perform the 
 join in the reducer. This will require the small tables in the join to fit in 
 the reducer/hash table for this to work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez

2015-06-30 Thread Jason Dere (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dere updated HIVE-10673:
--
Description: 
Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 2/3 
of the CPU was spent during sorting/merging.
While this does not work for MR, for other execution engines (such as Tez), it 
is possible to create a reduce-side join that uses unsorted inputs in order to 
eliminate the sorting, which may be faster than a shuffle join. To join on 
unsorted inputs, we can use the hash join algorithm to perform the join in the 
reducer. This will require the small tables in the join to fit in the 
reducer/hash table for this to work.

  was:Reduce-side hash join (using MapJoinOperator), where the Tez inputs to 
the reducer are unsorted.


 Dynamically partitioned hash join for Tez
 -

 Key: HIVE-10673
 URL: https://issues.apache.org/jira/browse/HIVE-10673
 Project: Hive
  Issue Type: New Feature
  Components: Query Planning, Query Processor
Reporter: Jason Dere
Assignee: Jason Dere
 Attachments: HIVE-10673.1.patch, HIVE-10673.2.patch, 
 HIVE-10673.3.patch, HIVE-10673.4.patch, HIVE-10673.5.patch


 Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 
 2/3 of the CPU was spent during sorting/merging.
 While this does not work for MR, for other execution engines (such as Tez), 
 it is possible to create a reduce-side join that uses unsorted inputs in 
 order to eliminate the sorting, which may be faster than a shuffle join. To 
 join on unsorted inputs, we can use the hash join algorithm to perform the 
 join in the reducer. This will require the small tables in the join to fit in 
 the reducer/hash table for this to work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez

2015-06-30 Thread Jason Dere (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dere updated HIVE-10673:
--
Attachment: HIVE-10673.6.patch

patch v6 - review feedback from Vikram

 Dynamically partitioned hash join for Tez
 -

 Key: HIVE-10673
 URL: https://issues.apache.org/jira/browse/HIVE-10673
 Project: Hive
  Issue Type: New Feature
  Components: Query Planning, Query Processor
Reporter: Jason Dere
Assignee: Jason Dere
 Attachments: HIVE-10673.1.patch, HIVE-10673.2.patch, 
 HIVE-10673.3.patch, HIVE-10673.4.patch, HIVE-10673.5.patch, HIVE-10673.6.patch


 Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 
 2/3 of the CPU was spent during sorting/merging.
 While this does not work for MR, for other execution engines (such as Tez), 
 it is possible to create a reduce-side join that uses unsorted inputs in 
 order to eliminate the sorting, which may be faster than a shuffle join. To 
 join on unsorted inputs, we can use the hash join algorithm to perform the 
 join in the reducer. This will require the small tables in the join to fit in 
 the reducer/hash table for this to work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez

2015-06-29 Thread Jason Dere (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dere updated HIVE-10673:
--
Issue Type: New Feature  (was: Bug)

 Dynamically partitioned hash join for Tez
 -

 Key: HIVE-10673
 URL: https://issues.apache.org/jira/browse/HIVE-10673
 Project: Hive
  Issue Type: New Feature
  Components: Query Planning, Query Processor
Reporter: Jason Dere
Assignee: Jason Dere
 Attachments: HIVE-10673.1.patch, HIVE-10673.2.patch, 
 HIVE-10673.3.patch, HIVE-10673.4.patch, HIVE-10673.5.patch


 Reduce-side hash join (using MapJoinOperator), where the Tez inputs to the 
 reducer are unsorted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez

2015-06-22 Thread Jason Dere (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dere updated HIVE-10673:
--
Attachment: HIVE-10673.5.patch

Patch v5 - rebasing with trunk

 Dynamically partitioned hash join for Tez
 -

 Key: HIVE-10673
 URL: https://issues.apache.org/jira/browse/HIVE-10673
 Project: Hive
  Issue Type: Bug
  Components: Query Planning, Query Processor
Reporter: Jason Dere
Assignee: Jason Dere
 Attachments: HIVE-10673.1.patch, HIVE-10673.2.patch, 
 HIVE-10673.3.patch, HIVE-10673.4.patch, HIVE-10673.5.patch


 Reduce-side hash join (using MapJoinOperator), where the Tez inputs to the 
 reducer are unsorted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez

2015-06-05 Thread Jason Dere (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dere updated HIVE-10673:
--
Attachment: HIVE-10673.4.patch

Patch v4: proper rebase of v2 (I hope).

 Dynamically partitioned hash join for Tez
 -

 Key: HIVE-10673
 URL: https://issues.apache.org/jira/browse/HIVE-10673
 Project: Hive
  Issue Type: Bug
  Components: Query Planning, Query Processor
Reporter: Jason Dere
Assignee: Jason Dere
 Attachments: HIVE-10673.1.patch, HIVE-10673.2.patch, 
 HIVE-10673.3.patch, HIVE-10673.4.patch


 Reduce-side hash join (using MapJoinOperator), where the Tez inputs to the 
 reducer are unsorted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez

2015-05-14 Thread Jason Dere (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dere updated HIVE-10673:
--
Attachment: HIVE-10673.2.patch

Patch v2 - addressing RB feedback from [~apivovarov]

 Dynamically partitioned hash join for Tez
 -

 Key: HIVE-10673
 URL: https://issues.apache.org/jira/browse/HIVE-10673
 Project: Hive
  Issue Type: Bug
  Components: Query Planning, Query Processor
Reporter: Jason Dere
Assignee: Jason Dere
 Attachments: HIVE-10673.1.patch, HIVE-10673.2.patch


 Reduce-side hash join (using MapJoinOperator), where the Tez inputs to the 
 reducer are unsorted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez

2015-05-11 Thread Jason Dere (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dere updated HIVE-10673:
--
Attachment: HIVE-10673.1.patch

Initial patch

 Dynamically partitioned hash join for Tez
 -

 Key: HIVE-10673
 URL: https://issues.apache.org/jira/browse/HIVE-10673
 Project: Hive
  Issue Type: Bug
  Components: Query Planning, Query Processor
Reporter: Jason Dere
Assignee: Jason Dere
 Attachments: HIVE-10673.1.patch


 Reduce-side hash join (using MapJoinOperator), where the Tez inputs to the 
 reducer are unsorted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)