[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez
[ https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lefty Leverenz updated HIVE-10673: -- Labels: TODOC1.3 (was: ) Dynamically partitioned hash join for Tez - Key: HIVE-10673 URL: https://issues.apache.org/jira/browse/HIVE-10673 Project: Hive Issue Type: New Feature Components: Query Planning, Query Processor Reporter: Jason Dere Assignee: Jason Dere Labels: TODOC1.3 Fix For: 1.3.0, 2.0.0 Attachments: HIVE-10673.1.patch, HIVE-10673.10.patch, HIVE-10673.11.patch, HIVE-10673.12, HIVE-10673.2.patch, HIVE-10673.3.patch, HIVE-10673.4.patch, HIVE-10673.5.patch, HIVE-10673.6.patch, HIVE-10673.7.patch, HIVE-10673.8.patch, HIVE-10673.9.patch Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 2/3 of the CPU was spent during sorting/merging. While this does not work for MR, for other execution engines (such as Tez), it is possible to create a reduce-side join that uses unsorted inputs in order to eliminate the sorting, which may be faster than a shuffle join. To join on unsorted inputs, we can use the hash join algorithm to perform the join in the reducer. This will require the small tables in the join to fit in the reducer/hash table for this to work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez
[ https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dere updated HIVE-10673: -- Release Note: This adds configuration parameter hive.optimize.dynamic.partition.hashjoin, which enables selection of the dynamically partitioned hash join with the Tez execution engine Dynamically partitioned hash join for Tez - Key: HIVE-10673 URL: https://issues.apache.org/jira/browse/HIVE-10673 Project: Hive Issue Type: New Feature Components: Query Planning, Query Processor Reporter: Jason Dere Assignee: Jason Dere Fix For: 1.3.0, 2.0.0 Attachments: HIVE-10673.1.patch, HIVE-10673.10.patch, HIVE-10673.11.patch, HIVE-10673.12, HIVE-10673.2.patch, HIVE-10673.3.patch, HIVE-10673.4.patch, HIVE-10673.5.patch, HIVE-10673.6.patch, HIVE-10673.7.patch, HIVE-10673.8.patch, HIVE-10673.9.patch Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 2/3 of the CPU was spent during sorting/merging. While this does not work for MR, for other execution engines (such as Tez), it is possible to create a reduce-side join that uses unsorted inputs in order to eliminate the sorting, which may be faster than a shuffle join. To join on unsorted inputs, we can use the hash join algorithm to perform the join in the reducer. This will require the small tables in the join to fit in the reducer/hash table for this to work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez
[ https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dere updated HIVE-10673: -- Attachment: HIVE-10673.12 patch v12: rebase with trunk, adding comment per Vikram's feedback. Dynamically partitioned hash join for Tez - Key: HIVE-10673 URL: https://issues.apache.org/jira/browse/HIVE-10673 Project: Hive Issue Type: New Feature Components: Query Planning, Query Processor Reporter: Jason Dere Assignee: Jason Dere Attachments: HIVE-10673.1.patch, HIVE-10673.10.patch, HIVE-10673.11.patch, HIVE-10673.12, HIVE-10673.2.patch, HIVE-10673.3.patch, HIVE-10673.4.patch, HIVE-10673.5.patch, HIVE-10673.6.patch, HIVE-10673.7.patch, HIVE-10673.8.patch, HIVE-10673.9.patch Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 2/3 of the CPU was spent during sorting/merging. While this does not work for MR, for other execution engines (such as Tez), it is possible to create a reduce-side join that uses unsorted inputs in order to eliminate the sorting, which may be faster than a shuffle join. To join on unsorted inputs, we can use the hash join algorithm to perform the join in the reducer. This will require the small tables in the join to fit in the reducer/hash table for this to work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez
[ https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dere updated HIVE-10673: -- Attachment: HIVE-10673.11.patch The golden files added in this patch needed to be regenerated after HIVE-11152. Attaching patch v11. TestCliDriver.testCliDriver_index_auto_mult_tables_compact and TestJdbcWithLocalClusterSpark.testTempTable do not fail when I run it locally with the patch. Dynamically partitioned hash join for Tez - Key: HIVE-10673 URL: https://issues.apache.org/jira/browse/HIVE-10673 Project: Hive Issue Type: New Feature Components: Query Planning, Query Processor Reporter: Jason Dere Assignee: Jason Dere Attachments: HIVE-10673.1.patch, HIVE-10673.10.patch, HIVE-10673.11.patch, HIVE-10673.2.patch, HIVE-10673.3.patch, HIVE-10673.4.patch, HIVE-10673.5.patch, HIVE-10673.6.patch, HIVE-10673.7.patch, HIVE-10673.8.patch, HIVE-10673.9.patch Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 2/3 of the CPU was spent during sorting/merging. While this does not work for MR, for other execution engines (such as Tez), it is possible to create a reduce-side join that uses unsorted inputs in order to eliminate the sorting, which may be faster than a shuffle join. To join on unsorted inputs, we can use the hash join algorithm to perform the join in the reducer. This will require the small tables in the join to fit in the reducer/hash table for this to work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez
[ https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dere updated HIVE-10673: -- Attachment: HIVE-10673.10.patch Patch v10: - Rebase with trunk, looks like some methods in GenTezUtils were converted to static - When selecting distributed hash join, the join operator should get OpTraits/stats set - For the issue regarding the flattened expressions in the vectorized rowObjectInspector, change the workaround to un-flatten the object inspector during JoinUtil.getObjectInspectorsFromEvaluators(). This is still a bit of a workaround, but only requires a change in 1 place, rather than the 2 changes needed in the previous solution (having to modify the column names during vectorized MapJoinOperator, as well as when generating the vectorized rowObjectInspector in VectorizedBatchUtil) - In the reducer, only the big table's input source should be vectorized Dynamically partitioned hash join for Tez - Key: HIVE-10673 URL: https://issues.apache.org/jira/browse/HIVE-10673 Project: Hive Issue Type: New Feature Components: Query Planning, Query Processor Reporter: Jason Dere Assignee: Jason Dere Attachments: HIVE-10673.1.patch, HIVE-10673.10.patch, HIVE-10673.2.patch, HIVE-10673.3.patch, HIVE-10673.4.patch, HIVE-10673.5.patch, HIVE-10673.6.patch, HIVE-10673.7.patch, HIVE-10673.8.patch, HIVE-10673.9.patch Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 2/3 of the CPU was spent during sorting/merging. While this does not work for MR, for other execution engines (such as Tez), it is possible to create a reduce-side join that uses unsorted inputs in order to eliminate the sorting, which may be faster than a shuffle join. To join on unsorted inputs, we can use the hash join algorithm to perform the join in the reducer. This will require the small tables in the join to fit in the reducer/hash table for this to work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez
[ https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dere updated HIVE-10673: -- Attachment: HIVE-10673.9.patch Precommit tests never ran - re-uploading patch Dynamically partitioned hash join for Tez - Key: HIVE-10673 URL: https://issues.apache.org/jira/browse/HIVE-10673 Project: Hive Issue Type: New Feature Components: Query Planning, Query Processor Reporter: Jason Dere Assignee: Jason Dere Attachments: HIVE-10673.1.patch, HIVE-10673.2.patch, HIVE-10673.3.patch, HIVE-10673.4.patch, HIVE-10673.5.patch, HIVE-10673.6.patch, HIVE-10673.7.patch, HIVE-10673.8.patch, HIVE-10673.9.patch Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 2/3 of the CPU was spent during sorting/merging. While this does not work for MR, for other execution engines (such as Tez), it is possible to create a reduce-side join that uses unsorted inputs in order to eliminate the sorting, which may be faster than a shuffle join. To join on unsorted inputs, we can use the hash join algorithm to perform the join in the reducer. This will require the small tables in the join to fit in the reducer/hash table for this to work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez
[ https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dere updated HIVE-10673: -- Attachment: HIVE-10673.8.patch Fixing failure in tez_smb_1.q - the big table position in CommonMergeJoinOperator and the ReduceWork were different, they need to be consistent for the merge join to work properly. Dynamically partitioned hash join for Tez - Key: HIVE-10673 URL: https://issues.apache.org/jira/browse/HIVE-10673 Project: Hive Issue Type: New Feature Components: Query Planning, Query Processor Reporter: Jason Dere Assignee: Jason Dere Attachments: HIVE-10673.1.patch, HIVE-10673.2.patch, HIVE-10673.3.patch, HIVE-10673.4.patch, HIVE-10673.5.patch, HIVE-10673.6.patch, HIVE-10673.7.patch, HIVE-10673.8.patch Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 2/3 of the CPU was spent during sorting/merging. While this does not work for MR, for other execution engines (such as Tez), it is possible to create a reduce-side join that uses unsorted inputs in order to eliminate the sorting, which may be faster than a shuffle join. To join on unsorted inputs, we can use the hash join algorithm to perform the join in the reducer. This will require the small tables in the join to fit in the reducer/hash table for this to work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez
[ https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dere updated HIVE-10673: -- Attachment: HIVE-10673.7.patch Update unit tests Dynamically partitioned hash join for Tez - Key: HIVE-10673 URL: https://issues.apache.org/jira/browse/HIVE-10673 Project: Hive Issue Type: New Feature Components: Query Planning, Query Processor Reporter: Jason Dere Assignee: Jason Dere Attachments: HIVE-10673.1.patch, HIVE-10673.2.patch, HIVE-10673.3.patch, HIVE-10673.4.patch, HIVE-10673.5.patch, HIVE-10673.6.patch, HIVE-10673.7.patch Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 2/3 of the CPU was spent during sorting/merging. While this does not work for MR, for other execution engines (such as Tez), it is possible to create a reduce-side join that uses unsorted inputs in order to eliminate the sorting, which may be faster than a shuffle join. To join on unsorted inputs, we can use the hash join algorithm to perform the join in the reducer. This will require the small tables in the join to fit in the reducer/hash table for this to work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez
[ https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dere updated HIVE-10673: -- Description: Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 2/3 of the CPU was spent during sorting/merging. While this does not work for MR, for other execution engines (such as Tez), it is possible to create a reduce-side join that uses unsorted inputs in order to eliminate the sorting, which may be faster than a shuffle join. To join on unsorted inputs, we can use the hash join algorithm to perform the join in the reducer. This will require the small tables in the join to fit in the reducer/hash table for this to work. was:Reduce-side hash join (using MapJoinOperator), where the Tez inputs to the reducer are unsorted. Dynamically partitioned hash join for Tez - Key: HIVE-10673 URL: https://issues.apache.org/jira/browse/HIVE-10673 Project: Hive Issue Type: New Feature Components: Query Planning, Query Processor Reporter: Jason Dere Assignee: Jason Dere Attachments: HIVE-10673.1.patch, HIVE-10673.2.patch, HIVE-10673.3.patch, HIVE-10673.4.patch, HIVE-10673.5.patch Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 2/3 of the CPU was spent during sorting/merging. While this does not work for MR, for other execution engines (such as Tez), it is possible to create a reduce-side join that uses unsorted inputs in order to eliminate the sorting, which may be faster than a shuffle join. To join on unsorted inputs, we can use the hash join algorithm to perform the join in the reducer. This will require the small tables in the join to fit in the reducer/hash table for this to work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez
[ https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dere updated HIVE-10673: -- Attachment: HIVE-10673.6.patch patch v6 - review feedback from Vikram Dynamically partitioned hash join for Tez - Key: HIVE-10673 URL: https://issues.apache.org/jira/browse/HIVE-10673 Project: Hive Issue Type: New Feature Components: Query Planning, Query Processor Reporter: Jason Dere Assignee: Jason Dere Attachments: HIVE-10673.1.patch, HIVE-10673.2.patch, HIVE-10673.3.patch, HIVE-10673.4.patch, HIVE-10673.5.patch, HIVE-10673.6.patch Some analysis of shuffle join queries by [~mmokhtar]/[~gopalv] found about 2/3 of the CPU was spent during sorting/merging. While this does not work for MR, for other execution engines (such as Tez), it is possible to create a reduce-side join that uses unsorted inputs in order to eliminate the sorting, which may be faster than a shuffle join. To join on unsorted inputs, we can use the hash join algorithm to perform the join in the reducer. This will require the small tables in the join to fit in the reducer/hash table for this to work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez
[ https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dere updated HIVE-10673: -- Issue Type: New Feature (was: Bug) Dynamically partitioned hash join for Tez - Key: HIVE-10673 URL: https://issues.apache.org/jira/browse/HIVE-10673 Project: Hive Issue Type: New Feature Components: Query Planning, Query Processor Reporter: Jason Dere Assignee: Jason Dere Attachments: HIVE-10673.1.patch, HIVE-10673.2.patch, HIVE-10673.3.patch, HIVE-10673.4.patch, HIVE-10673.5.patch Reduce-side hash join (using MapJoinOperator), where the Tez inputs to the reducer are unsorted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez
[ https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dere updated HIVE-10673: -- Attachment: HIVE-10673.5.patch Patch v5 - rebasing with trunk Dynamically partitioned hash join for Tez - Key: HIVE-10673 URL: https://issues.apache.org/jira/browse/HIVE-10673 Project: Hive Issue Type: Bug Components: Query Planning, Query Processor Reporter: Jason Dere Assignee: Jason Dere Attachments: HIVE-10673.1.patch, HIVE-10673.2.patch, HIVE-10673.3.patch, HIVE-10673.4.patch, HIVE-10673.5.patch Reduce-side hash join (using MapJoinOperator), where the Tez inputs to the reducer are unsorted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez
[ https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dere updated HIVE-10673: -- Attachment: HIVE-10673.4.patch Patch v4: proper rebase of v2 (I hope). Dynamically partitioned hash join for Tez - Key: HIVE-10673 URL: https://issues.apache.org/jira/browse/HIVE-10673 Project: Hive Issue Type: Bug Components: Query Planning, Query Processor Reporter: Jason Dere Assignee: Jason Dere Attachments: HIVE-10673.1.patch, HIVE-10673.2.patch, HIVE-10673.3.patch, HIVE-10673.4.patch Reduce-side hash join (using MapJoinOperator), where the Tez inputs to the reducer are unsorted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez
[ https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dere updated HIVE-10673: -- Attachment: HIVE-10673.2.patch Patch v2 - addressing RB feedback from [~apivovarov] Dynamically partitioned hash join for Tez - Key: HIVE-10673 URL: https://issues.apache.org/jira/browse/HIVE-10673 Project: Hive Issue Type: Bug Components: Query Planning, Query Processor Reporter: Jason Dere Assignee: Jason Dere Attachments: HIVE-10673.1.patch, HIVE-10673.2.patch Reduce-side hash join (using MapJoinOperator), where the Tez inputs to the reducer are unsorted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10673) Dynamically partitioned hash join for Tez
[ https://issues.apache.org/jira/browse/HIVE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dere updated HIVE-10673: -- Attachment: HIVE-10673.1.patch Initial patch Dynamically partitioned hash join for Tez - Key: HIVE-10673 URL: https://issues.apache.org/jira/browse/HIVE-10673 Project: Hive Issue Type: Bug Components: Query Planning, Query Processor Reporter: Jason Dere Assignee: Jason Dere Attachments: HIVE-10673.1.patch Reduce-side hash join (using MapJoinOperator), where the Tez inputs to the reducer are unsorted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)