[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=468139&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-468139 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 08/Aug/20 01:22 Start Date: 08/Aug/20 01:22 Worklog Time Spent: 10m Work Description: jcamachor merged pull request #1147: URL: https://github.com/apache/hive/pull/1147 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 468139) Time Spent: 16h 50m (was: 16h 40m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 16h 50m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=468138&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-468138 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 08/Aug/20 01:19 Start Date: 08/Aug/20 01:19 Worklog Time Spent: 10m Work Description: jcamachor commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r467343664 ## File path: ql/src/test/results/clientpositive/perf/tez/cbo_query16_anti_join.q.out ## @@ -0,0 +1,99 @@ +PREHOOK: query: explain cbo +select + count(distinct cs_order_number) as `order count` + ,sum(cs_ext_ship_cost) as `total shipping cost` + ,sum(cs_net_profit) as `total net profit` +from + catalog_sales cs1 + ,date_dim + ,customer_address + ,call_center +where +d_date between '2001-4-01' and + (cast('2001-4-01' as date) + 60 days) +and cs1.cs_ship_date_sk = d_date_sk +and cs1.cs_ship_addr_sk = ca_address_sk +and ca_state = 'NY' +and cs1.cs_call_center_sk = cc_call_center_sk +and cc_county in ('Ziebach County','Levy County','Huron County','Franklin Parish', + 'Daviess County' +) +and exists (select * +from catalog_sales cs2 +where cs1.cs_order_number = cs2.cs_order_number + and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk) +and not exists(select * + from catalog_returns cr1 + where cs1.cs_order_number = cr1.cr_order_number) +order by count(distinct cs_order_number) +limit 100 +PREHOOK: type: QUERY +PREHOOK: Input: default@call_center +PREHOOK: Input: default@catalog_returns +PREHOOK: Input: default@catalog_sales +PREHOOK: Input: default@customer_address +PREHOOK: Input: default@date_dim +PREHOOK: Output: hdfs://### HDFS PATH ### +POSTHOOK: query: explain cbo +select + count(distinct cs_order_number) as `order count` + ,sum(cs_ext_ship_cost) as `total shipping cost` + ,sum(cs_net_profit) as `total net profit` +from + catalog_sales cs1 + ,date_dim + ,customer_address + ,call_center +where +d_date between '2001-4-01' and + (cast('2001-4-01' as date) + 60 days) +and cs1.cs_ship_date_sk = d_date_sk +and cs1.cs_ship_addr_sk = ca_address_sk +and ca_state = 'NY' +and cs1.cs_call_center_sk = cc_call_center_sk +and cc_county in ('Ziebach County','Levy County','Huron County','Franklin Parish', + 'Daviess County' +) +and exists (select * +from catalog_sales cs2 +where cs1.cs_order_number = cs2.cs_order_number + and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk) +and not exists(select * + from catalog_returns cr1 + where cs1.cs_order_number = cr1.cr_order_number) +order by count(distinct cs_order_number) +limit 100 +POSTHOOK: type: QUERY +POSTHOOK: Input: default@call_center +POSTHOOK: Input: default@catalog_returns +POSTHOOK: Input: default@catalog_sales +POSTHOOK: Input: default@customer_address +POSTHOOK: Input: default@date_dim +POSTHOOK: Output: hdfs://### HDFS PATH ### +CBO PLAN: +HiveAggregate(group=[{}], agg#0=[count(DISTINCT $4)], agg#1=[sum($5)], agg#2=[sum($6)]) + HiveJoin(condition=[=($4, $14)], joinType=[anti], algorithm=[none], cost=[not available]) Review comment: Do we have a JIRA to explore this optimization? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 468138) Time Spent: 16h 40m (was: 16.5h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 16h 40m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. Thi
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467800&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467800 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 07/Aug/20 10:35 Start Date: 07/Aug/20 10:35 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r466819149 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAntiSemiJoinRule.java ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.rules; + +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptUtil; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.core.Filter; +import org.apache.calcite.rel.core.Join; +import org.apache.calcite.rel.core.JoinRelType; +import org.apache.calcite.rel.core.Project; +import org.apache.calcite.rex.RexNode; +import org.apache.calcite.sql.SqlKind; +import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil; +import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAntiJoin; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +/** + * Planner rule that converts a join plus filter to anti join. + */ +public class HiveAntiSemiJoinRule extends RelOptRule { + protected static final Logger LOG = LoggerFactory.getLogger(HiveAntiSemiJoinRule.class); + public static final HiveAntiSemiJoinRule INSTANCE = new HiveAntiSemiJoinRule(); + + //HiveProject(fld=[$0]) + // HiveFilter(condition=[IS NULL($1)]) + //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], cost=[not available]) + // + // TO + // + //HiveProject(fld_tbl=[$0]) + // HiveAntiJoin(condition=[=($0, $1)], joinType=[anti]) + // + public HiveAntiSemiJoinRule() { +super(operand(Project.class, operand(Filter.class, operand(Join.class, RelOptRule.any(, +"HiveJoinWithFilterToAntiJoinRule:filter"); + } + + // is null filter over a left join. + public void onMatch(final RelOptRuleCall call) { +final Project project = call.rel(0); +final Filter filter = call.rel(1); +final Join join = call.rel(2); +perform(call, project, filter, join); + } + + protected void perform(RelOptRuleCall call, Project project, Filter filter, Join join) { +LOG.debug("Start Matching HiveAntiJoinRule"); + +//TODO : Need to support this scenario. +if (join.getCondition().isAlwaysTrue()) { + return; +} + +//We support conversion from left outer join only. +if (join.getJoinType() != JoinRelType.LEFT) { + return; +} + +assert (filter != null); + +// If null filter is not present from right side then we can not convert to anti join. +List aboveFilters = RelOptUtil.conjunctions(filter.getCondition()); +Stream nullFilters = aboveFilters.stream().filter(filterNode -> filterNode.getKind() == SqlKind.IS_NULL); +boolean hasNullFilter = HiveCalciteUtil.hasAnyExpressionFromRightSide(join, nullFilters.collect(Collectors.toList())); +if (!hasNullFilter) { + return; +} + +// If any projection is there from right side, then we can not convert to anti join. +boolean hasProjection = HiveCalciteUtil.hasAnyExpressionFromRightSide(join, project.getProjects()); +if (hasProjection) { + return; +} + +LOG.debug("Matched HiveAntiJoinRule"); + +// Build anti join with same left, right child and condition as original left outer join. +Join anti = HiveAntiJoin.getAntiJoin(join.getLeft().getCluster(), join.getLeft().getTraitSet(), +join.getLeft(), join.getRight(), join.getCondition()); +RelNode newProject = project.copy(project.getTraitSet(), anti, project.getProjects(), project.getRowType()); +call.transformTo(newProject); Review comment
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467717&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467717 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 07/Aug/20 05:22 Start Date: 07/Aug/20 05:22 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r466826953 ## File path: ql/src/test/results/clientpositive/perf/tez/constraints/cbo_query94_anti_join.q.out ## @@ -0,0 +1,94 @@ +PREHOOK: query: explain cbo +select + count(distinct ws_order_number) as `order count` + ,sum(ws_ext_ship_cost) as `total shipping cost` + ,sum(ws_net_profit) as `total net profit` +from + web_sales ws1 + ,date_dim + ,customer_address + ,web_site +where +d_date between '1999-5-01' and + (cast('1999-5-01' as date) + 60 days) +and ws1.ws_ship_date_sk = d_date_sk +and ws1.ws_ship_addr_sk = ca_address_sk +and ca_state = 'TX' +and ws1.ws_web_site_sk = web_site_sk +and web_company_name = 'pri' +and exists (select * +from web_sales ws2 +where ws1.ws_order_number = ws2.ws_order_number + and ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk) +and not exists(select * + from web_returns wr1 + where ws1.ws_order_number = wr1.wr_order_number) +order by count(distinct ws_order_number) +limit 100 +PREHOOK: type: QUERY +PREHOOK: Input: default@customer_address +PREHOOK: Input: default@date_dim +PREHOOK: Input: default@web_returns +PREHOOK: Input: default@web_sales +PREHOOK: Input: default@web_site +PREHOOK: Output: hdfs://### HDFS PATH ### +POSTHOOK: query: explain cbo +select + count(distinct ws_order_number) as `order count` + ,sum(ws_ext_ship_cost) as `total shipping cost` + ,sum(ws_net_profit) as `total net profit` +from + web_sales ws1 + ,date_dim + ,customer_address + ,web_site +where +d_date between '1999-5-01' and + (cast('1999-5-01' as date) + 60 days) +and ws1.ws_ship_date_sk = d_date_sk +and ws1.ws_ship_addr_sk = ca_address_sk +and ca_state = 'TX' +and ws1.ws_web_site_sk = web_site_sk +and web_company_name = 'pri' +and exists (select * +from web_sales ws2 +where ws1.ws_order_number = ws2.ws_order_number + and ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk) +and not exists(select * + from web_returns wr1 + where ws1.ws_order_number = wr1.wr_order_number) +order by count(distinct ws_order_number) +limit 100 +POSTHOOK: type: QUERY +POSTHOOK: Input: default@customer_address +POSTHOOK: Input: default@date_dim +POSTHOOK: Input: default@web_returns +POSTHOOK: Input: default@web_sales +POSTHOOK: Input: default@web_site +POSTHOOK: Output: hdfs://### HDFS PATH ### +CBO PLAN: +HiveAggregate(group=[{}], agg#0=[count(DISTINCT $4)], agg#1=[sum($5)], agg#2=[sum($6)]) + HiveJoin(condition=[=($4, $14)], joinType=[anti], algorithm=[none], cost=[not available]) Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 467717) Time Spent: 16h 20m (was: 16h 10m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 16h 20m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition.
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467716&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467716 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 07/Aug/20 05:21 Start Date: 07/Aug/20 05:21 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r466826675 ## File path: ql/src/test/results/clientpositive/perf/tez/cbo_query16_anti_join.q.out ## @@ -0,0 +1,99 @@ +PREHOOK: query: explain cbo Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 467716) Time Spent: 16h 10m (was: 16h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 16h 10m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467713&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467713 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 07/Aug/20 05:19 Start Date: 07/Aug/20 05:19 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r466826103 ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -2162,7 +2162,8 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Whether Hive enables the optimization about converting common join into mapjoin based on the input file size. \n" + "If this parameter is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than the\n" + "specified size, the join is directly converted to a mapjoin (there is no conditional task)."), - +HIVE_CONVERT_ANTI_JOIN("hive.auto.convert.anti.join", false, Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 467713) Time Spent: 16h (was: 15h 50m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 16h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467711&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467711 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 07/Aug/20 05:17 Start Date: 07/Aug/20 05:17 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r466825617 ## File path: ql/src/test/results/clientpositive/llap/subquery_notexists_having.q.out ## @@ -31,7 +31,8 @@ STAGE PLANS: Tez A masked pattern was here Edges: -Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE) +Reducer 2 <- Map 1 (SIMPLE_EDGE) Review comment: yes ..the join is getting converted to SMB join ..and so no reducer is required. In case of anti join its not getting converted. That is because left outer is adding an extra group by which is making the RS node on left and right side equal, the pre-condition for converting to SMB join. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 467711) Time Spent: 15h 50m (was: 15h 40m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 15h 50m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467710&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467710 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 07/Aug/20 05:10 Start Date: 07/Aug/20 05:10 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r466823815 ## File path: ql/src/test/results/clientpositive/llap/antijoin.q.out ## @@ -0,0 +1,1007 @@ +PREHOOK: query: create table t1_n55 as select cast(key as int) key, value from src where key <= 10 +PREHOOK: type: CREATETABLE_AS_SELECT +PREHOOK: Input: default@src +PREHOOK: Output: database:default +PREHOOK: Output: default@t1_n55 +POSTHOOK: query: create table t1_n55 as select cast(key as int) key, value from src where key <= 10 +POSTHOOK: type: CREATETABLE_AS_SELECT +POSTHOOK: Input: default@src +POSTHOOK: Output: database:default +POSTHOOK: Output: default@t1_n55 +POSTHOOK: Lineage: t1_n55.key EXPRESSION [(src)src.FieldSchema(name:key, type:string, comment:default), ] +POSTHOOK: Lineage: t1_n55.value SIMPLE [(src)src.FieldSchema(name:value, type:string, comment:default), ] +PREHOOK: query: select * from t1_n55 sort by key +PREHOOK: type: QUERY +PREHOOK: Input: default@t1_n55 + A masked pattern was here +POSTHOOK: query: select * from t1_n55 sort by key +POSTHOOK: type: QUERY +POSTHOOK: Input: default@t1_n55 + A masked pattern was here +0 val_0 Review comment: now i have made anti join conversion to true by default This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 467710) Time Spent: 15h 40m (was: 15.5h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 15h 40m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467709&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467709 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 07/Aug/20 05:04 Start Date: 07/Aug/20 05:04 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r466822516 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java ## @@ -2129,6 +2133,16 @@ private RelNode applyPreJoinOrderingTransforms(RelNode basePlan, RelMetadataProv HiveRemoveSqCountCheck.INSTANCE); } + // 10. Convert left outer join + null filter on right side table column to anti join. Add this + // rule after all the optimization for which calcite support for anti join is missing. + // Needs to be done before ProjectRemoveRule as it expect a project over filter. + // This is done before join re-ordering as join re-ordering is converting the left outer Review comment: As discussed, i have created a Jira https://issues.apache.org/jira/browse/HIVE-24013 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 467709) Time Spent: 15.5h (was: 15h 20m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 15.5h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467708&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467708 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 07/Aug/20 04:52 Start Date: 07/Aug/20 04:52 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r466819572 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinAddNotNullRule.java ## @@ -92,7 +104,7 @@ public void onMatch(RelOptRuleCall call) { Set rightPushedPredicates = Sets.newHashSet(registry.getPushedPredicates(join, 1)); boolean genPredOnLeft = join.getJoinType() == JoinRelType.RIGHT || join.getJoinType() == JoinRelType.INNER || join.isSemiJoin(); -boolean genPredOnRight = join.getJoinType() == JoinRelType.LEFT || join.getJoinType() == JoinRelType.INNER || join.isSemiJoin(); +boolean genPredOnRight = join.getJoinType() == JoinRelType.LEFT || join.getJoinType() == JoinRelType.INNER || join.isSemiJoin()|| join.getJoinType() == JoinRelType.ANTI; Review comment: yes ..that is taken care of. // For anti join, we should proceed to emit records if the right side is empty or not matching. if (type == JoinDesc.ANTI_JOIN && !producedRow) { This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 467708) Time Spent: 15h 20m (was: 15h 10m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 15h 20m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467706&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467706 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 07/Aug/20 04:50 Start Date: 07/Aug/20 04:50 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r466819149 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAntiSemiJoinRule.java ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.rules; + +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptUtil; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.core.Filter; +import org.apache.calcite.rel.core.Join; +import org.apache.calcite.rel.core.JoinRelType; +import org.apache.calcite.rel.core.Project; +import org.apache.calcite.rex.RexNode; +import org.apache.calcite.sql.SqlKind; +import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil; +import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAntiJoin; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +/** + * Planner rule that converts a join plus filter to anti join. + */ +public class HiveAntiSemiJoinRule extends RelOptRule { + protected static final Logger LOG = LoggerFactory.getLogger(HiveAntiSemiJoinRule.class); + public static final HiveAntiSemiJoinRule INSTANCE = new HiveAntiSemiJoinRule(); + + //HiveProject(fld=[$0]) + // HiveFilter(condition=[IS NULL($1)]) + //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], cost=[not available]) + // + // TO + // + //HiveProject(fld_tbl=[$0]) + // HiveAntiJoin(condition=[=($0, $1)], joinType=[anti]) + // + public HiveAntiSemiJoinRule() { +super(operand(Project.class, operand(Filter.class, operand(Join.class, RelOptRule.any(, +"HiveJoinWithFilterToAntiJoinRule:filter"); + } + + // is null filter over a left join. + public void onMatch(final RelOptRuleCall call) { +final Project project = call.rel(0); +final Filter filter = call.rel(1); +final Join join = call.rel(2); +perform(call, project, filter, join); + } + + protected void perform(RelOptRuleCall call, Project project, Filter filter, Join join) { +LOG.debug("Start Matching HiveAntiJoinRule"); + +//TODO : Need to support this scenario. +if (join.getCondition().isAlwaysTrue()) { + return; +} + +//We support conversion from left outer join only. +if (join.getJoinType() != JoinRelType.LEFT) { + return; +} + +assert (filter != null); + +// If null filter is not present from right side then we can not convert to anti join. +List aboveFilters = RelOptUtil.conjunctions(filter.getCondition()); +Stream nullFilters = aboveFilters.stream().filter(filterNode -> filterNode.getKind() == SqlKind.IS_NULL); +boolean hasNullFilter = HiveCalciteUtil.hasAnyExpressionFromRightSide(join, nullFilters.collect(Collectors.toList())); +if (!hasNullFilter) { + return; +} + +// If any projection is there from right side, then we can not convert to anti join. +boolean hasProjection = HiveCalciteUtil.hasAnyExpressionFromRightSide(join, project.getProjects()); +if (hasProjection) { + return; +} + +LOG.debug("Matched HiveAntiJoinRule"); + +// Build anti join with same left, right child and condition as original left outer join. +Join anti = HiveAntiJoin.getAntiJoin(join.getLeft().getCluster(), join.getLeft().getTraitSet(), +join.getLeft(), join.getRight(), join.getCondition()); +RelNode newProject = project.copy(project.getTraitSet(), anti, project.getProjects(), project.getRowType()); +call.transformTo(newProject); Review comment
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467704&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467704 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 07/Aug/20 04:48 Start Date: 07/Aug/20 04:48 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r466818492 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinMultiKeyOperator.java ## @@ -0,0 +1,400 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.VectorSerializeRow; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinBytesHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.apache.hadoop.hive.serde2.ByteStream.Output; +import org.apache.hadoop.hive.serde2.binarysortable.fast.BinarySortableSerializeWrite; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// Multi-Key hash table import. +// Multi-Key specific imports. + +// TODO : Duplicate codes need to merge with semi join. +/* + * Specialized class for doing a vectorized map join that is an anti join on Multi-Key + * using hash set. + */ +public class VectorMapJoinAntiJoinMultiKeyOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + + // + + private static final String CLASS_NAME = VectorMapJoinAntiJoinMultiKeyOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // + + // (none) + + // The above members are initialized by the constructor and must not be + // transient. + //--- + + // The hash map for this specialized class. + private transient VectorMapJoinBytesHashSet hashSet; + + //--- + // Multi-Key specific members. + // + + // Object that can take a set of columns in row in a vectorized row batch and serialized it. + // Known to not have any nulls. + private transient VectorSerializeRow keyVectorSerializeWrite; + + // The BinarySortable serialization of the current key. + private transient Output currentKeyOutput; + + // The BinarySortable serialization of the saved key for a possible series of equal keys. + private transient Output saveKeyOutput; + + //--- + // Pass-thru constructors. + // + + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinMultiKeyOperator() { +super(); + } + + public VectorMapJoinAntiJoinMultiKeyOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinMultiKeyOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + //--- + // Process Multi-Key Anti Join on a vectorized row batch. + // + + @Override + protected void commonSetup() thro
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467703&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467703 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 07/Aug/20 04:47 Start Date: 07/Aug/20 04:47 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r466818358 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// TODO : Duplicate codes need to merge with semi join. +// Single-Column Long hash table import. +// Single-Column Long specific imports. + +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column Long + * using a hash set. + */ +public class VectorMapJoinAntiJoinLongOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final String CLASS_NAME = VectorMapJoinAntiJoinLongOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // The above members are initialized by the constructor and must not be + // transient. + + // The hash map for this specialized class. + private transient VectorMapJoinLongHashSet hashSet; + + // Single-Column Long specific members. + // For integers, we have optional min/max filtering. + private transient boolean useMinMax; + private transient long min; + private transient long max; + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + // Pass-thru constructors. + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinLongOperator() { +super(); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + // Process Single-Column Long Anti Join on a vectorized row batch. + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Initialize Single-Column Long members for this specialized class. +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +// Get our Single-Column Long hash set information for this specialized class. +hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable; +useMinMax = hashSet.useMinMax(); +if (useMinMax) { + min = hashSet.min(); + max = hashSet.max(); +} + } + + @Override + public void processBatch(VectorizedRowBatch batch) throws HiveException { + +try { + // (Currently none) + // antiPerBatchSetup(batch); + + // For anti joins, we may apply the filter(s) now. + for(VectorExpression ve : bigTableFilterExpressions) { +ve.evaluate(batch); +
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467702&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467702 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 07/Aug/20 04:46 Start Date: 07/Aug/20 04:46 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r466818194 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// TODO : Duplicate codes need to merge with semi join. +// Single-Column Long hash table import. +// Single-Column Long specific imports. + +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column Long + * using a hash set. + */ +public class VectorMapJoinAntiJoinLongOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final String CLASS_NAME = VectorMapJoinAntiJoinLongOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // The above members are initialized by the constructor and must not be + // transient. + + // The hash map for this specialized class. + private transient VectorMapJoinLongHashSet hashSet; + + // Single-Column Long specific members. + // For integers, we have optional min/max filtering. + private transient boolean useMinMax; + private transient long min; + private transient long max; + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + // Pass-thru constructors. + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinLongOperator() { +super(); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + // Process Single-Column Long Anti Join on a vectorized row batch. + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Initialize Single-Column Long members for this specialized class. +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +// Get our Single-Column Long hash set information for this specialized class. +hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable; +useMinMax = hashSet.useMinMax(); +if (useMinMax) { + min = hashSet.min(); + max = hashSet.max(); +} + } + + @Override + public void processBatch(VectorizedRowBatch batch) throws HiveException { + +try { + // (Currently none) + // antiPerBatchSetup(batch); + + // For anti joins, we may apply the filter(s) now. + for(VectorExpression ve : bigTableFilterExpressions) { +ve.evaluate(batch); +
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=466570&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-466570 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 05/Aug/20 03:13 Start Date: 05/Aug/20 03:13 Worklog Time Spent: 10m Work Description: jcamachor commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r465446495 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinAddNotNullRule.java ## @@ -92,7 +104,7 @@ public void onMatch(RelOptRuleCall call) { Set rightPushedPredicates = Sets.newHashSet(registry.getPushedPredicates(join, 1)); boolean genPredOnLeft = join.getJoinType() == JoinRelType.RIGHT || join.getJoinType() == JoinRelType.INNER || join.isSemiJoin(); -boolean genPredOnRight = join.getJoinType() == JoinRelType.LEFT || join.getJoinType() == JoinRelType.INNER || join.isSemiJoin(); +boolean genPredOnRight = join.getJoinType() == JoinRelType.LEFT || join.getJoinType() == JoinRelType.INNER || join.isSemiJoin()|| join.getJoinType() == JoinRelType.ANTI; Review comment: I was referring to empty input (no rows) rather than null. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 466570) Time Spent: 14.5h (was: 14h 20m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 14.5h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=466569&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-466569 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 05/Aug/20 03:12 Start Date: 05/Aug/20 03:12 Worklog Time Spent: 10m Work Description: jcamachor commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r465446298 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAntiSemiJoinRule.java ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.rules; + +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptUtil; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.core.Filter; +import org.apache.calcite.rel.core.Join; +import org.apache.calcite.rel.core.JoinRelType; +import org.apache.calcite.rel.core.Project; +import org.apache.calcite.rex.RexNode; +import org.apache.calcite.sql.SqlKind; +import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil; +import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAntiJoin; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +/** + * Planner rule that converts a join plus filter to anti join. + */ +public class HiveAntiSemiJoinRule extends RelOptRule { + protected static final Logger LOG = LoggerFactory.getLogger(HiveAntiSemiJoinRule.class); + public static final HiveAntiSemiJoinRule INSTANCE = new HiveAntiSemiJoinRule(); + + //HiveProject(fld=[$0]) + // HiveFilter(condition=[IS NULL($1)]) + //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], cost=[not available]) + // + // TO + // + //HiveProject(fld_tbl=[$0]) + // HiveAntiJoin(condition=[=($0, $1)], joinType=[anti]) + // + public HiveAntiSemiJoinRule() { +super(operand(Project.class, operand(Filter.class, operand(Join.class, RelOptRule.any(, +"HiveJoinWithFilterToAntiJoinRule:filter"); + } + + // is null filter over a left join. + public void onMatch(final RelOptRuleCall call) { +final Project project = call.rel(0); +final Filter filter = call.rel(1); +final Join join = call.rel(2); +perform(call, project, filter, join); + } + + protected void perform(RelOptRuleCall call, Project project, Filter filter, Join join) { +LOG.debug("Start Matching HiveAntiJoinRule"); + +//TODO : Need to support this scenario. +if (join.getCondition().isAlwaysTrue()) { + return; +} + +//We support conversion from left outer join only. +if (join.getJoinType() != JoinRelType.LEFT) { + return; +} + +assert (filter != null); + +// If null filter is not present from right side then we can not convert to anti join. +List aboveFilters = RelOptUtil.conjunctions(filter.getCondition()); +Stream nullFilters = aboveFilters.stream().filter(filterNode -> filterNode.getKind() == SqlKind.IS_NULL); +boolean hasNullFilter = HiveCalciteUtil.hasAnyExpressionFromRightSide(join, nullFilters.collect(Collectors.toList())); +if (!hasNullFilter) { + return; +} + +// If any projection is there from right side, then we can not convert to anti join. +boolean hasProjection = HiveCalciteUtil.hasAnyExpressionFromRightSide(join, project.getProjects()); +if (hasProjection) { + return; +} + +LOG.debug("Matched HiveAntiJoinRule"); + +// Build anti join with same left, right child and condition as original left outer join. +Join anti = HiveAntiJoin.getAntiJoin(join.getLeft().getCluster(), join.getLeft().getTraitSet(), +join.getLeft(), join.getRight(), join.getCondition()); +RelNode newProject = project.copy(project.getTraitSet(), anti, project.getProjects(), project.getRowType()); +call.transformTo(newProject); Review comment:
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=466425&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-466425 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 04/Aug/20 20:02 Start Date: 04/Aug/20 20:02 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r465297563 ## File path: ql/src/test/queries/clientpositive/subquery_in_having.q ## @@ -140,6 +140,22 @@ CREATE TABLE src_null_n4 (key STRING COMMENT 'default', value STRING COMMENT 'de LOAD DATA LOCAL INPATH "../../data/files/kv1.txt" INTO TABLE src_null_n4; INSERT INTO src_null_n4 values('5444', null); +explain +select key, value, count(*) Review comment: By default anti join conversion is set to true. I have added few test cases with anti join set to false. ## File path: ql/src/java/org/apache/hadoop/hive/ql/ppd/PredicateTransitivePropagate.java ## @@ -203,6 +203,7 @@ private boolean filterExists(ReduceSinkOperator target, ExprNodeDesc replaced) { vector.add(right, left); break; case JoinDesc.LEFT_OUTER_JOIN: +case JoinDesc.ANTI_JOIN: //TODO : need to test Review comment: removed the comment. ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinConstraintsRule.java ## @@ -183,6 +189,7 @@ public void onMatch(RelOptRuleCall call) { switch (joinType) { case SEMI: case INNER: +case ANTI: Review comment: done ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinAddNotNullRule.java ## @@ -92,7 +104,7 @@ public void onMatch(RelOptRuleCall call) { Set rightPushedPredicates = Sets.newHashSet(registry.getPushedPredicates(join, 1)); boolean genPredOnLeft = join.getJoinType() == JoinRelType.RIGHT || join.getJoinType() == JoinRelType.INNER || join.isSemiJoin(); -boolean genPredOnRight = join.getJoinType() == JoinRelType.LEFT || join.getJoinType() == JoinRelType.INNER || join.isSemiJoin(); +boolean genPredOnRight = join.getJoinType() == JoinRelType.LEFT || join.getJoinType() == JoinRelType.INNER || join.isSemiJoin()|| join.getJoinType() == JoinRelType.ANTI; Review comment: Yes ..if right side is null then it emits all the right side records ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAntiSemiJoinRule.java ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.rules; + +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptUtil; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.core.Filter; +import org.apache.calcite.rel.core.Join; +import org.apache.calcite.rel.core.JoinRelType; +import org.apache.calcite.rel.core.Project; +import org.apache.calcite.rex.RexNode; +import org.apache.calcite.sql.SqlKind; +import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil; +import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAntiJoin; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +/** + * Planner rule that converts a join plus filter to anti join. + */ +public class HiveAntiSemiJoinRule extends RelOptRule { + protected static final Logger LOG = LoggerFactory.getLogger(HiveAntiSemiJoinRule.class); + public static final HiveAntiSemiJoinRule INSTANCE = new HiveAntiSemiJoinRule(); + + //HiveProject(fld=[$0]) + // HiveFilter(condition=[IS NULL($1)]) + //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], cost=[not available]) + // + // TO + // + //HiveProject(fld_tbl=[$0]) + // HiveAntiJoin(condition=[=($0, $1)], joinType=[anti]) + // + public HiveAntiSemiJoinRule() { +super(operand(Pro
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=465943&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465943 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 03/Aug/20 23:03 Start Date: 03/Aug/20 23:03 Worklog Time Spent: 10m Work Description: jcamachor commented on pull request #1147: URL: https://github.com/apache/hive/pull/1147#issuecomment-668281996 @maheshk114 , thanks for addressing the first batch of comments. PR looks better. I have done a second pass and left some additional comments that should be addressed before merging. Please, also merge master into your branch, since there seem to be some conflicts. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 465943) Time Spent: 14h (was: 13h 50m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 14h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=465942&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465942 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 03/Aug/20 23:02 Start Date: 03/Aug/20 23:02 Worklog Time Spent: 10m Work Description: jcamachor commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r464673502 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveRelOptUtil.java ## @@ -747,6 +747,8 @@ public static RewritablePKFKJoinInfo isRewritablePKFKJoin(Join join, final RelNode nonFkInput = leftInputPotentialFK ? join.getRight() : join.getLeft(); final RewritablePKFKJoinInfo nonRewritable = RewritablePKFKJoinInfo.of(false, null); +// TODO : Need to handle Anti join. Review comment: Thanks for creating HIVE-23906. Can we simply return `nonRewritable` if it is an anti-join for the time being, rather than proceeding? This certainly requires a bit of extra thinking and specific tests to make sure it is working as expected (for which we already have HIVE-23906). ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinConstraintsRule.java ## @@ -183,6 +189,7 @@ public void onMatch(RelOptRuleCall call) { switch (joinType) { case SEMI: case INNER: +case ANTI: Review comment: This should be removed to avoid confusion, since we bail out above. ## File path: ql/src/test/queries/clientpositive/subquery_in_having.q ## @@ -140,6 +140,22 @@ CREATE TABLE src_null_n4 (key STRING COMMENT 'default', value STRING COMMENT 'de LOAD DATA LOCAL INPATH "../../data/files/kv1.txt" INTO TABLE src_null_n4; INSERT INTO src_null_n4 values('5444', null); +explain +select key, value, count(*) Review comment: Should we execute this query with conversion=true? ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveCalciteUtil.java ## @@ -1233,4 +1233,21 @@ public FixNullabilityShuttle(RexBuilder rexBuilder, } } + // Checks if any of the expression given as list expressions are from right side of the join. Review comment: nit. Change comment to javadoc ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAntiSemiJoinRule.java ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.rules; + +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptUtil; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.core.Filter; +import org.apache.calcite.rel.core.Join; +import org.apache.calcite.rel.core.JoinRelType; +import org.apache.calcite.rel.core.Project; +import org.apache.calcite.rex.RexNode; +import org.apache.calcite.sql.SqlKind; +import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil; +import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAntiJoin; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +/** + * Planner rule that converts a join plus filter to anti join. + */ +public class HiveAntiSemiJoinRule extends RelOptRule { + protected static final Logger LOG = LoggerFactory.getLogger(HiveAntiSemiJoinRule.class); + public static final HiveAntiSemiJoinRule INSTANCE = new HiveAntiSemiJoinRule(); + + //HiveProject(fld=[$0]) + // HiveFilter(condition=[IS NULL($1)]) + //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], cost=[not available]) + // + // TO + // + //HiveProject(fld_tbl=[$0]) + // HiveAntiJoin(condition=[=($0, $1)], joinType=[anti]) + // + public HiveAntiSemiJoinRule() { +super(operand(Project.class, operand(Filter.class, operand(Join.class, RelOptRule.any(, +"HiveJoinWithFilterToAntiJoinRule:filter");
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463446&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463446 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 27/Jul/20 01:41 Start Date: 27/Jul/20 01:41 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460606016 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinGenerateResultOperator.java ## @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSet; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSetResult; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashTableResult; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; + +// TODO : This class is duplicate of semi join. Need to do a refactoring to merge it with semi join. +/** + * This class has methods for generating vectorized join results for Anti joins. + * The big difference between inner joins and anti joins is existence testing. + * Inner joins use a hash map to lookup the 1 or more small table values. + * Anti joins are a specialized join for outputting big table rows whose key exists Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 463446) Time Spent: 13h 40m (was: 13.5h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 13h 40m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463444&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463444 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 27/Jul/20 01:40 Start Date: 27/Jul/20 01:40 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460605836 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinGenerateResultOperator.java ## @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSet; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSetResult; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashTableResult; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; + +// TODO : This class is duplicate of semi join. Need to do a refactoring to merge it with semi join. +/** + * This class has methods for generating vectorized join results for Anti joins. + * The big difference between inner joins and anti joins is existence testing. + * Inner joins use a hash map to lookup the 1 or more small table values. + * Anti joins are a specialized join for outputting big table rows whose key exists + * in the small table. + * + * No small table values are needed for anti since they would be empty. So, + * we use a hash set as the hash table. Hash sets just report whether a key exists. This + * is a big performance optimization. + */ +public abstract class VectorMapJoinAntiJoinGenerateResultOperator +extends VectorMapJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final Logger LOG = LoggerFactory.getLogger(VectorMapJoinAntiJoinGenerateResultOperator.class.getName()); + + // Anti join specific members. + + // An array of hash set results so we can do lookups on the whole batch before output result + // generation. + protected transient VectorMapJoinHashSetResult hashSetResults[]; + + // Pre-allocated member for storing the (physical) batch index of matching row (single- or + // multi-small-table-valued) indexes during a process call. + protected transient int[] allMatchs; + + // Pre-allocated member for storing the (physical) batch index of rows that need to be spilled. + protected transient int[] spills; + + // Pre-allocated member for storing index into the hashSetResults for each spilled row. + protected transient int[] spillHashMapResultIndices; + + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinGenerateResultOperator() { +super(); + } + + public VectorMapJoinAntiJoinGenerateResultOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinGenerateResultOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + /* + * Setup our anti join specific members. + */ + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Anti join specific. +VectorMapJoinHashSet baseHashSet = (VectorMapJoinHashSet) vectorMapJoinHashTable; + +hashSetResults = new VectorMapJoinHashSetResult[VectorizedRowBatch.DEFAULT_SIZE]; +for (int i = 0; i < hashS
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463443&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463443 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 27/Jul/20 01:40 Start Date: 27/Jul/20 01:40 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460605730 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ## @@ -523,11 +533,19 @@ private boolean createForwardJoinObject(boolean[] skip) throws HiveException { forward = true; } } +return forward; + } + + // returns whether a record was forwarded + private boolean createForwardJoinObject(boolean[] skip, boolean antiJoin) throws HiveException { +boolean forward = fillFwdCache(skip); if (forward) { if (needsPostEvaluation) { forward = !JoinUtil.isFiltered(forwardCache, residualJoinFilters, residualJoinFiltersOIs); } - if (forward) { + + // For anti join, check all right side and if nothing is matched then only forward. Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 463443) Time Spent: 13h 20m (was: 13h 10m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 13h 20m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463442&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463442 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 27/Jul/20 01:39 Start Date: 27/Jul/20 01:39 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460605498 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ## @@ -638,6 +657,12 @@ private void genObject(int aliasNum, boolean allLeftFirst, boolean allLeftNull) // skipping the rest of the rows in the rhs table of the semijoin done = !needsPostEvaluation; } + } else if (type == JoinDesc.ANTI_JOIN) { +if (innerJoin(skip, left, right)) { + // if anti join found a match then the condition is not matched for anti join, so we can skip rest of the Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 463442) Time Spent: 13h 10m (was: 13h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 13h 10m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463351&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463351 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 26/Jul/20 12:38 Start Date: 26/Jul/20 12:38 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460522562 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// TODO : Duplicate codes need to merge with semi join. +// Single-Column Long hash table import. +// Single-Column Long specific imports. + +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column Long + * using a hash set. + */ +public class VectorMapJoinAntiJoinLongOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final String CLASS_NAME = VectorMapJoinAntiJoinLongOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // The above members are initialized by the constructor and must not be + // transient. + + // The hash map for this specialized class. + private transient VectorMapJoinLongHashSet hashSet; + + // Single-Column Long specific members. + // For integers, we have optional min/max filtering. + private transient boolean useMinMax; + private transient long min; + private transient long max; + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + // Pass-thru constructors. + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinLongOperator() { +super(); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + // Process Single-Column Long Anti Join on a vectorized row batch. + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Initialize Single-Column Long members for this specialized class. +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +// Get our Single-Column Long hash set information for this specialized class. +hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable; +useMinMax = hashSet.useMinMax(); +if (useMinMax) { + min = hashSet.min(); + max = hashSet.max(); +} + } + + @Override + public void processBatch(VectorizedRowBatch batch) throws HiveException { + +try { + // (Currently none) + // antiPerBatchSetup(batch); + + // For anti joins, we may apply the filter(s) now. + for(VectorExpression ve : bigTableFilterExpressions) { +ve.evaluate(batch); +
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463350&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463350 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 26/Jul/20 12:37 Start Date: 26/Jul/20 12:37 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460522454 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// TODO : Duplicate codes need to merge with semi join. +// Single-Column Long hash table import. +// Single-Column Long specific imports. + +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column Long + * using a hash set. + */ +public class VectorMapJoinAntiJoinLongOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final String CLASS_NAME = VectorMapJoinAntiJoinLongOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // The above members are initialized by the constructor and must not be + // transient. + + // The hash map for this specialized class. + private transient VectorMapJoinLongHashSet hashSet; + + // Single-Column Long specific members. + // For integers, we have optional min/max filtering. + private transient boolean useMinMax; + private transient long min; + private transient long max; + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + // Pass-thru constructors. + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinLongOperator() { +super(); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + // Process Single-Column Long Anti Join on a vectorized row batch. + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Initialize Single-Column Long members for this specialized class. +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +// Get our Single-Column Long hash set information for this specialized class. +hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable; +useMinMax = hashSet.useMinMax(); +if (useMinMax) { + min = hashSet.min(); + max = hashSet.max(); +} + } + + @Override + public void processBatch(VectorizedRowBatch batch) throws HiveException { + +try { + // (Currently none) + // antiPerBatchSetup(batch); + + // For anti joins, we may apply the filter(s) now. + for(VectorExpression ve : bigTableFilterExpressions) { +ve.evaluate(batch); +
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463348&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463348 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 26/Jul/20 12:36 Start Date: 26/Jul/20 12:36 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460522261 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinStringOperator.java ## @@ -0,0 +1,371 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.StringExpr; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinBytesHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// Single-Column String hash table import. +// Single-Column String specific imports. + +// TODO : Duplicate codes need to merge with semi join. +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column String + * using a hash set. + */ +public class VectorMapJoinAntiJoinStringOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + + // + + private static final String CLASS_NAME = VectorMapJoinAntiJoinStringOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // + + // (none) + + // The above members are initialized by the constructor and must not be + // transient. + //--- + + // The hash map for this specialized class. + private transient VectorMapJoinBytesHashSet hashSet; + + //--- + // Single-Column String specific members. + // + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + //--- + // Pass-thru constructors. + // + + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinStringOperator() { +super(); + } + + public VectorMapJoinAntiJoinStringOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinStringOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + //--- + // Process Single-Column String anti Join on a vectorized row batch. + // + + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +/* + * Initialize Single-Column String members for this specialized class. + */ + +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +/* + * Get our Single-Column String hash set information for
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463349&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463349 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 26/Jul/20 12:36 Start Date: 26/Jul/20 12:36 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460522312 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// TODO : Duplicate codes need to merge with semi join. +// Single-Column Long hash table import. +// Single-Column Long specific imports. + +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column Long + * using a hash set. + */ +public class VectorMapJoinAntiJoinLongOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final String CLASS_NAME = VectorMapJoinAntiJoinLongOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // The above members are initialized by the constructor and must not be + // transient. + + // The hash map for this specialized class. + private transient VectorMapJoinLongHashSet hashSet; + + // Single-Column Long specific members. + // For integers, we have optional min/max filtering. + private transient boolean useMinMax; + private transient long min; + private transient long max; + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + // Pass-thru constructors. + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinLongOperator() { +super(); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + // Process Single-Column Long Anti Join on a vectorized row batch. + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Initialize Single-Column Long members for this specialized class. +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +// Get our Single-Column Long hash set information for this specialized class. +hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable; +useMinMax = hashSet.useMinMax(); +if (useMinMax) { + min = hashSet.min(); + max = hashSet.max(); +} + } + + @Override + public void processBatch(VectorizedRowBatch batch) throws HiveException { + +try { + // (Currently none) + // antiPerBatchSetup(batch); + + // For anti joins, we may apply the filter(s) now. + for(VectorExpression ve : bigTableFilterExpressions) { +ve.evaluate(batch); +
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463346&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463346 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 26/Jul/20 12:27 Start Date: 26/Jul/20 12:27 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460521384 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ## @@ -509,11 +513,17 @@ protected void addToAliasFilterTags(byte alias, List object, boolean isN } } + private void createForwardJoinObjectForAntiJoin(boolean[] skip) throws HiveException { +boolean forward = fillFwdCache(skip); Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 463346) Time Spent: 12h 20m (was: 12h 10m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 12h 20m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463345&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463345 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 26/Jul/20 12:26 Start Date: 26/Jul/20 12:26 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460521246 ## File path: parser/src/java/org/apache/hadoop/hive/ql/parse/FromClauseParser.g ## @@ -145,6 +145,7 @@ joinToken | KW_RIGHT (KW_OUTER)? KW_JOIN -> TOK_RIGHTOUTERJOIN | KW_FULL (KW_OUTER)? KW_JOIN -> TOK_FULLOUTERJOIN | KW_LEFT KW_SEMI KW_JOIN -> TOK_LEFTSEMIJOIN +| KW_ANTI KW_JOIN -> TOK_ANTIJOIN Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 463345) Time Spent: 12h 10m (was: 12h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 12h 10m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463344&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463344 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 26/Jul/20 12:24 Start Date: 26/Jul/20 12:24 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460521056 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ## @@ -153,6 +153,8 @@ transient boolean hasLeftSemiJoin = false; + transient boolean hasAntiJoin = false; Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 463344) Time Spent: 12h (was: 11h 50m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 12h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463343&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463343 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 26/Jul/20 12:23 Start Date: 26/Jul/20 12:23 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460520930 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveRelFactories.java ## @@ -188,6 +193,20 @@ public RelNode createSemiJoin(RelNode left, RelNode right, } } + /** + * Implementation of {@link AntiJoinFactory} that returns + * {@link org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAntiJoin} + * . + */ + private static class HiveAntiJoinFactoryImpl implements SemiJoinFactory { Review comment: HiveAntiJoinFactoryImpl is removed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 463343) Time Spent: 11h 50m (was: 11h 40m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 11h 50m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463342&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463342 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 26/Jul/20 12:23 Start Date: 26/Jul/20 12:23 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460520903 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveRelOptMaterializationValidator.java ## @@ -253,6 +256,14 @@ private RelNode visit(HiveSemiJoin semiJoin) { return visitChildren(semiJoin); } + // Note: Not currently part of the HiveRelNode interface + private RelNode visit(HiveAntiJoin antiJoin) { Review comment: Not sure ..copy pasted from semi join. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 463342) Time Spent: 11h 40m (was: 11.5h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 11h 40m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463341&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463341 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 26/Jul/20 12:20 Start Date: 26/Jul/20 12:20 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460520647 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveSubQRemoveRelBuilder.java ## @@ -1112,7 +1112,7 @@ public RexNode field(RexNode e, String name) { } public HiveSubQRemoveRelBuilder join(JoinRelType joinType, RexNode condition, - Set variablesSet, boolean createSemiJoin) { + Set variablesSet, JoinRelType semiJoinType) { Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 463341) Time Spent: 11.5h (was: 11h 20m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 11.5h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463340&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463340 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 26/Jul/20 12:10 Start Date: 26/Jul/20 12:10 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460519523 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinAddNotNullRule.java ## @@ -56,6 +57,9 @@ public static final HiveJoinAddNotNullRule INSTANCE_SEMIJOIN = new HiveJoinAddNotNullRule(HiveSemiJoin.class, HiveRelFactories.HIVE_FILTER_FACTORY); + public static final HiveJoinAddNotNullRule INSTANCE_ANTIJOIN = + new HiveJoinAddNotNullRule(HiveAntiJoin.class, HiveRelFactories.HIVE_FILTER_FACTORY); Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 463340) Time Spent: 11h 20m (was: 11h 10m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 11h 20m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463339&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463339 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 26/Jul/20 12:05 Start Date: 26/Jul/20 12:05 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460519111 ## File path: ql/src/java/org/apache/hadoop/hive/ql/plan/VectorMapJoinDesc.java ## @@ -89,7 +89,8 @@ public PrimitiveTypeInfo getPrimitiveTypeInfo() { INNER_BIG_ONLY, LEFT_SEMI, OUTER, -FULL_OUTER +FULL_OUTER, +ANTI Review comment: LEFT_ANTI This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 463339) Time Spent: 11h 10m (was: 11h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 11h 10m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463338&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463338 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 26/Jul/20 12:04 Start Date: 26/Jul/20 12:04 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460518974 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java ## @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.rules; + +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptUtil; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.core.Filter; +import org.apache.calcite.rel.core.Join; +import org.apache.calcite.rel.core.JoinRelType; +import org.apache.calcite.rel.core.Project; +import org.apache.calcite.rel.type.RelDataTypeField; +import org.apache.calcite.rex.RexInputRef; +import org.apache.calcite.rex.RexNode; +import org.apache.calcite.sql.SqlKind; +import org.apache.calcite.util.ImmutableBitSet; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; + +/** + * Planner rule that converts a join plus filter to anti join. + */ +public class HiveJoinWithFilterToAntiJoinRule extends RelOptRule { Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 463338) Time Spent: 11h (was: 10h 50m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 11h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exist
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463337&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463337 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 26/Jul/20 12:03 Start Date: 26/Jul/20 12:03 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460518799 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java ## @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.rules; + +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptUtil; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.core.Filter; +import org.apache.calcite.rel.core.Join; +import org.apache.calcite.rel.core.JoinRelType; +import org.apache.calcite.rel.core.Project; +import org.apache.calcite.rel.type.RelDataTypeField; +import org.apache.calcite.rex.RexInputRef; +import org.apache.calcite.rex.RexNode; +import org.apache.calcite.sql.SqlKind; +import org.apache.calcite.util.ImmutableBitSet; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; + +/** + * Planner rule that converts a join plus filter to anti join. + */ +public class HiveJoinWithFilterToAntiJoinRule extends RelOptRule { + protected static final Logger LOG = LoggerFactory.getLogger(HiveJoinWithFilterToAntiJoinRule.class); + public static final HiveJoinWithFilterToAntiJoinRule INSTANCE = new HiveJoinWithFilterToAntiJoinRule(); + + //HiveProject(fld=[$0]) + // HiveFilter(condition=[IS NULL($1)]) + //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], cost=[not available]) + // + // TO + // + //HiveProject(fld_tbl=[$0]) + // HiveAntiJoin(condition=[=($0, $1)], joinType=[anti]) + // + public HiveJoinWithFilterToAntiJoinRule() { +super(operand(Project.class, operand(Filter.class, operand(Join.class, RelOptRule.any(, +"HiveJoinWithFilterToAntiJoinRule:filter"); + } + + // is null filter over a left join. + public void onMatch(final RelOptRuleCall call) { +final Project project = call.rel(0); +final Filter filter = call.rel(1); +final Join join = call.rel(2); +perform(call, project, filter, join); + } + + protected void perform(RelOptRuleCall call, Project project, Filter filter, Join join) { +LOG.debug("Matched HiveAntiJoinRule"); + +if (join.getCondition().isAlwaysTrue()) { + return; +} + +//We support conversion from left outer join only. +if (join.getJoinType() != JoinRelType.LEFT) { + return; +} + +assert (filter != null); + +List aboveFilters = RelOptUtil.conjunctions(filter.getCondition()); +boolean hasIsNull = false; + +// Get all filter condition and check if any of them is a "is null" kind. +for (RexNode filterNode : aboveFilters) { + if (filterNode.getKind() == SqlKind.IS_NULL && + isFilterFromRightSide(join, filterNode, join.getJoinType())) { +hasIsNull = true; +break; + } +} + +// Is null should be on a key from right side of the join. +if (!hasIsNull) { + return; +} + +// Build anti join with same left, right child and condition as original left outer join. +Join anti = join.copy(join.getTraitSet(), join.getCondition(), +join.getLeft(), join.getRight(), JoinRelType.ANTI, false); + +//TODO : Do we really need it +call.getPlanner().onCopy(join, anti); + +RelNode newProject = getNewProjectNode(project, anti); +if (newProject != null) { + call.getPlanner().onCopy(project, newProject); Review comment: done ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java ## @@ -0,0 +1,149 @@ +/* + * Licensed to
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463335&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463335 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 26/Jul/20 11:59 Start Date: 26/Jul/20 11:59 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460518411 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdRowCount.java ## @@ -118,6 +119,15 @@ public Double getRowCount(HiveJoin join, RelMetadataQuery mq) { } public Double getRowCount(HiveSemiJoin rel, RelMetadataQuery mq) { +return getRowCountInt(rel, mq); + } + + public Double getRowCount(HiveAntiJoin rel, RelMetadataQuery mq) { +return getRowCountInt(rel, mq); + } + + private Double getRowCountInt(Join rel, RelMetadataQuery mq) { Review comment: super.getRowCount(rel, mq) does not support Anti join. I think we need to handle it. https://issues.apache.org/jira/browse/HIVE-23933 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 463335) Time Spent: 10.5h (was: 10h 20m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 10.5h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463336&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463336 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 26/Jul/20 11:59 Start Date: 26/Jul/20 11:59 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460518454 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdDistinctRowCount.java ## @@ -79,6 +80,11 @@ public Double getDistinctRowCount(HiveSemiJoin rel, RelMetadataQuery mq, Immutab return super.getDistinctRowCount(rel, mq, groupKey, predicate); } + public Double getDistinctRowCount(HiveAntiJoin rel, RelMetadataQuery mq, ImmutableBitSet groupKey, +RexNode predicate) { +return super.getDistinctRowCount(rel, mq, groupKey, predicate); Review comment: https://issues.apache.org/jira/browse/HIVE-23933 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 463336) Time Spent: 10h 40m (was: 10.5h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 10h 40m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463334&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463334 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 26/Jul/20 11:34 Start Date: 26/Jul/20 11:34 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460515695 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdRowCount.java ## @@ -118,6 +119,15 @@ public Double getRowCount(HiveJoin join, RelMetadataQuery mq) { } public Double getRowCount(HiveSemiJoin rel, RelMetadataQuery mq) { +return getRowCountInt(rel, mq); + } + + public Double getRowCount(HiveAntiJoin rel, RelMetadataQuery mq) { +return getRowCountInt(rel, mq); + } + + private Double getRowCountInt(Join rel, RelMetadataQuery mq) { Review comment: Yes done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 463334) Time Spent: 10h 20m (was: 10h 10m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 10h 20m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=46&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-46 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 26/Jul/20 11:29 Start Date: 26/Jul/20 11:29 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460515257 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRemoveGBYSemiJoinRule.java ## @@ -41,17 +41,19 @@ public HiveRemoveGBYSemiJoinRule() { super( -operand(HiveSemiJoin.class, +operand(Join.class, some( operand(RelNode.class, any()), operand(Aggregate.class, any(, HiveRelFactories.HIVE_BUILDER, "HiveRemoveGBYSemiJoinRule"); } @Override public void onMatch(RelOptRuleCall call) { -final HiveSemiJoin semijoin= call.rel(0); +final Join join= call.rel(0); Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 46) Time Spent: 10h 10m (was: 10h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 10h 10m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462920&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462920 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 24/Jul/20 11:42 Start Date: 24/Jul/20 11:42 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460002261 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSelectivity.java ## @@ -142,7 +146,7 @@ private Double computeInnerJoinSelectivity(Join j, RelMetadataQuery mq, RexNode ndvEstimate = exponentialBackoff(peLst, colStatMap); } - if (j.isSemiJoin()) { + if (j.isSemiJoin() || (j instanceof HiveJoin && j.getJoinType().equals(JoinRelType.ANTI))) { Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462920) Time Spent: 10h (was: 9h 50m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 10h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462919&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462919 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 24/Jul/20 11:41 Start Date: 24/Jul/20 11:41 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460001925 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java ## @@ -2606,6 +2607,17 @@ private long computeFinalRowCount(List rowCountParents, long interimRowCou // max # of rows = rows from left side result = Math.min(rowCountParents.get(joinCond.getLeft()), result); break; +case JoinDesc.ANTI_JOIN: + long leftRowCount = rowCountParents.get(joinCond.getLeft()); + if (leftRowCount < result) { +// Ideally the inner join count should be less than the left row count. but if its not calculated +// properly then we can assume whole of left table will be selected. +result = leftRowCount; Review comment: This case will come if the stats are not proper. So to be on safer side, i assume that all rows from the left side will be projected. That is the max value. If set it to 0, it should not trigger some re-write, assuming the join result is empty. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462919) Time Spent: 9h 50m (was: 9h 40m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 9h 50m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462911&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462911 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 24/Jul/20 11:33 Start Date: 24/Jul/20 11:33 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459998986 ## File path: ql/src/test/results/clientpositive/perf/tez/cbo_query16_anti_join.q.out ## @@ -0,0 +1,99 @@ +PREHOOK: query: explain cbo +select + count(distinct cs_order_number) as `order count` + ,sum(cs_ext_ship_cost) as `total shipping cost` + ,sum(cs_net_profit) as `total net profit` +from + catalog_sales cs1 + ,date_dim + ,customer_address + ,call_center +where +d_date between '2001-4-01' and + (cast('2001-4-01' as date) + 60 days) +and cs1.cs_ship_date_sk = d_date_sk +and cs1.cs_ship_addr_sk = ca_address_sk +and ca_state = 'NY' +and cs1.cs_call_center_sk = cc_call_center_sk +and cc_county in ('Ziebach County','Levy County','Huron County','Franklin Parish', + 'Daviess County' +) +and exists (select * +from catalog_sales cs2 +where cs1.cs_order_number = cs2.cs_order_number + and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk) +and not exists(select * + from catalog_returns cr1 + where cs1.cs_order_number = cr1.cr_order_number) +order by count(distinct cs_order_number) +limit 100 +PREHOOK: type: QUERY +PREHOOK: Input: default@call_center +PREHOOK: Input: default@catalog_returns +PREHOOK: Input: default@catalog_sales +PREHOOK: Input: default@customer_address +PREHOOK: Input: default@date_dim +PREHOOK: Output: hdfs://### HDFS PATH ### +POSTHOOK: query: explain cbo +select + count(distinct cs_order_number) as `order count` + ,sum(cs_ext_ship_cost) as `total shipping cost` + ,sum(cs_net_profit) as `total net profit` +from + catalog_sales cs1 + ,date_dim + ,customer_address + ,call_center +where +d_date between '2001-4-01' and + (cast('2001-4-01' as date) + 60 days) +and cs1.cs_ship_date_sk = d_date_sk +and cs1.cs_ship_addr_sk = ca_address_sk +and ca_state = 'NY' +and cs1.cs_call_center_sk = cc_call_center_sk +and cc_county in ('Ziebach County','Levy County','Huron County','Franklin Parish', + 'Daviess County' +) +and exists (select * +from catalog_sales cs2 +where cs1.cs_order_number = cs2.cs_order_number + and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk) +and not exists(select * + from catalog_returns cr1 + where cs1.cs_order_number = cr1.cr_order_number) +order by count(distinct cs_order_number) +limit 100 +POSTHOOK: type: QUERY +POSTHOOK: Input: default@call_center +POSTHOOK: Input: default@catalog_returns +POSTHOOK: Input: default@catalog_sales +POSTHOOK: Input: default@customer_address +POSTHOOK: Input: default@date_dim +POSTHOOK: Output: hdfs://### HDFS PATH ### +CBO PLAN: +HiveAggregate(group=[{}], agg#0=[count(DISTINCT $4)], agg#1=[sum($5)], agg#2=[sum($6)]) + HiveJoin(condition=[=($4, $14)], joinType=[anti], algorithm=[none], cost=[not available]) +HiveSemiJoin(condition=[AND(<>($3, $13), =($4, $14))], joinType=[semi]) Review comment: done ..creating the HiveAntiJoin operator directly This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462911) Time Spent: 9.5h (was: 9h 20m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 9.5h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the dup
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462918&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462918 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 24/Jul/20 11:39 Start Date: 24/Jul/20 11:39 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r460001236 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java ## @@ -1901,6 +1905,11 @@ public RelNode apply(RelOptCluster cluster, RelOptSchema relOptSchema, SchemaPlu calcitePreCboPlan = applyPreJoinOrderingTransforms(calciteGenPlan, mdProvider.getMetadataProvider(), executorProvider); + if (conf.getBoolVar(ConfVars.HIVE_CONVERT_ANTI_JOIN)) { Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462918) Time Spent: 9h 40m (was: 9.5h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 9h 40m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462910&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462910 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 24/Jul/20 11:33 Start Date: 24/Jul/20 11:33 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459998845 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java ## @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.rules; + +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptUtil; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.core.Filter; +import org.apache.calcite.rel.core.Join; +import org.apache.calcite.rel.core.JoinRelType; +import org.apache.calcite.rel.core.Project; +import org.apache.calcite.rel.type.RelDataTypeField; +import org.apache.calcite.rex.RexInputRef; +import org.apache.calcite.rex.RexNode; +import org.apache.calcite.sql.SqlKind; +import org.apache.calcite.util.ImmutableBitSet; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; + +/** + * Planner rule that converts a join plus filter to anti join. + */ +public class HiveJoinWithFilterToAntiJoinRule extends RelOptRule { + protected static final Logger LOG = LoggerFactory.getLogger(HiveJoinWithFilterToAntiJoinRule.class); + public static final HiveJoinWithFilterToAntiJoinRule INSTANCE = new HiveJoinWithFilterToAntiJoinRule(); + + //HiveProject(fld=[$0]) + // HiveFilter(condition=[IS NULL($1)]) + //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], cost=[not available]) + // + // TO + // + //HiveProject(fld_tbl=[$0]) + // HiveAntiJoin(condition=[=($0, $1)], joinType=[anti]) + // + public HiveJoinWithFilterToAntiJoinRule() { +super(operand(Project.class, operand(Filter.class, operand(Join.class, RelOptRule.any(, +"HiveJoinWithFilterToAntiJoinRule:filter"); + } + + // is null filter over a left join. + public void onMatch(final RelOptRuleCall call) { +final Project project = call.rel(0); +final Filter filter = call.rel(1); +final Join join = call.rel(2); +perform(call, project, filter, join); + } + + protected void perform(RelOptRuleCall call, Project project, Filter filter, Join join) { +LOG.debug("Matched HiveAntiJoinRule"); + +assert (filter != null); + +//We support conversion from left outer join only. +if (join.getJoinType() != JoinRelType.LEFT) { + return; +} + +List aboveFilters = RelOptUtil.conjunctions(filter.getCondition()); +boolean hasIsNull = false; + +// Get all filter condition and check if any of them is a "is null" kind. +for (RexNode filterNode : aboveFilters) { + if (filterNode.getKind() == SqlKind.IS_NULL && + isFilterFromRightSide(join, filterNode, join.getJoinType())) { +hasIsNull = true; +break; + } +} + +// Is null should be on a key from right side of the join. +if (!hasIsNull) { + return; +} + +// Build anti join with same left, right child and condition as original left outer join. +Join anti = join.copy(join.getTraitSet(), join.getCondition(), Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462910) Time Spent: 9h 20m (was: 9h 10m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 >
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462907&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462907 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 24/Jul/20 11:12 Start Date: 24/Jul/20 11:12 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459990495 ## File path: ql/src/test/results/clientpositive/perf/tez/cbo_query16_anti_join.q.out ## @@ -0,0 +1,99 @@ +PREHOOK: query: explain cbo +select + count(distinct cs_order_number) as `order count` + ,sum(cs_ext_ship_cost) as `total shipping cost` + ,sum(cs_net_profit) as `total net profit` +from + catalog_sales cs1 + ,date_dim + ,customer_address + ,call_center +where +d_date between '2001-4-01' and + (cast('2001-4-01' as date) + 60 days) +and cs1.cs_ship_date_sk = d_date_sk +and cs1.cs_ship_addr_sk = ca_address_sk +and ca_state = 'NY' +and cs1.cs_call_center_sk = cc_call_center_sk +and cc_county in ('Ziebach County','Levy County','Huron County','Franklin Parish', + 'Daviess County' +) +and exists (select * +from catalog_sales cs2 +where cs1.cs_order_number = cs2.cs_order_number + and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk) +and not exists(select * + from catalog_returns cr1 + where cs1.cs_order_number = cr1.cr_order_number) +order by count(distinct cs_order_number) +limit 100 +PREHOOK: type: QUERY +PREHOOK: Input: default@call_center +PREHOOK: Input: default@catalog_returns +PREHOOK: Input: default@catalog_sales +PREHOOK: Input: default@customer_address +PREHOOK: Input: default@date_dim +PREHOOK: Output: hdfs://### HDFS PATH ### +POSTHOOK: query: explain cbo +select + count(distinct cs_order_number) as `order count` + ,sum(cs_ext_ship_cost) as `total shipping cost` + ,sum(cs_net_profit) as `total net profit` +from + catalog_sales cs1 + ,date_dim + ,customer_address + ,call_center +where +d_date between '2001-4-01' and + (cast('2001-4-01' as date) + 60 days) +and cs1.cs_ship_date_sk = d_date_sk +and cs1.cs_ship_addr_sk = ca_address_sk +and ca_state = 'NY' +and cs1.cs_call_center_sk = cc_call_center_sk +and cc_county in ('Ziebach County','Levy County','Huron County','Franklin Parish', + 'Daviess County' +) +and exists (select * +from catalog_sales cs2 +where cs1.cs_order_number = cs2.cs_order_number + and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk) +and not exists(select * + from catalog_returns cr1 + where cs1.cs_order_number = cr1.cr_order_number) +order by count(distinct cs_order_number) +limit 100 +POSTHOOK: type: QUERY +POSTHOOK: Input: default@call_center +POSTHOOK: Input: default@catalog_returns +POSTHOOK: Input: default@catalog_sales +POSTHOOK: Input: default@customer_address +POSTHOOK: Input: default@date_dim +POSTHOOK: Output: hdfs://### HDFS PATH ### +CBO PLAN: +HiveAggregate(group=[{}], agg#0=[count(DISTINCT $4)], agg#1=[sum($5)], agg#2=[sum($6)]) + HiveJoin(condition=[=($4, $14)], joinType=[anti], algorithm=[none], cost=[not available]) Review comment: I think, it's not a problem. The filed index are for different input. So even though the number is same the condition is different. Even without anti join, the condition is same. HiveFilter(condition=[IS NULL($13)]) HiveJoin(condition=[=($4, $14)], joinType=[left], algorithm=[none], cost=[not available]) HiveSemiJoin(condition=[AND(<>($3, $13), =($4, $14))], joinType=[semi]) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462907) Time Spent: 9h 10m (was: 9h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 9h 10m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant c
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462906&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462906 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 24/Jul/20 11:06 Start Date: 24/Jul/20 11:06 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459988214 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdDistinctRowCount.java ## @@ -79,6 +80,11 @@ public Double getDistinctRowCount(HiveSemiJoin rel, RelMetadataQuery mq, Immutab return super.getDistinctRowCount(rel, mq, groupKey, predicate); } + public Double getDistinctRowCount(HiveAntiJoin rel, RelMetadataQuery mq, ImmutableBitSet groupKey, +RexNode predicate) { +return super.getDistinctRowCount(rel, mq, groupKey, predicate); Review comment: calcite 21 does not support distinct calculation for Anti join. if (join.isSemiJoin()) { return getSemiJoinDistinctRowCount(join, mq, groupKey, predicate); } else { Builder leftMask = ImmutableBitSet.builder(); I think these rules will not get triggered as of now for Anti join as i am not converting the not-exists to anti join. As of now all these rules will be applied on left outer and then we convert the left outer to anti join. I This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462906) Time Spent: 9h (was: 8h 50m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 9h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462904&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462904 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 24/Jul/20 11:01 Start Date: 24/Jul/20 11:01 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459986329 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveSubQueryRemoveRule.java ## @@ -414,6 +416,13 @@ private RexNode rewriteInExists(RexSubQuery e, Set variablesSet, // null keys we do not need to generate count(*), count(c) if (e.getKind() == SqlKind.EXISTS) { logic = RelOptUtil.Logic.TRUE_FALSE; +if (conf.getBoolVar(HiveConf.ConfVars.HIVE_CONVERT_ANTI_JOIN)) { + //TODO : As of now anti join is first converted to left outer join Review comment: Now also the conversion is not done. The code is present but actual conversion is not done and logic is still TRUE_FALSE. For the code to be effective , the logic should be changed to FALSE. I have not done it yet, as it was causing some change in plan which i could not judge to be expected or not. Anyways i have created a JIRA to track this. https://issues.apache.org/jira/browse/HIVE-23928 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462904) Time Spent: 8h 50m (was: 8h 40m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 8h 50m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462903&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462903 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 24/Jul/20 10:55 Start Date: 24/Jul/20 10:55 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459984171 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java ## @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.rules; + +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptUtil; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.core.Filter; +import org.apache.calcite.rel.core.Join; +import org.apache.calcite.rel.core.JoinRelType; +import org.apache.calcite.rel.core.Project; +import org.apache.calcite.rel.type.RelDataTypeField; +import org.apache.calcite.rex.RexInputRef; +import org.apache.calcite.rex.RexNode; +import org.apache.calcite.sql.SqlKind; +import org.apache.calcite.util.ImmutableBitSet; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; + +/** + * Planner rule that converts a join plus filter to anti join. + */ +public class HiveJoinWithFilterToAntiJoinRule extends RelOptRule { + protected static final Logger LOG = LoggerFactory.getLogger(HiveJoinWithFilterToAntiJoinRule.class); + public static final HiveJoinWithFilterToAntiJoinRule INSTANCE = new HiveJoinWithFilterToAntiJoinRule(); + + //HiveProject(fld=[$0]) + // HiveFilter(condition=[IS NULL($1)]) + //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], cost=[not available]) + // + // TO + // + //HiveProject(fld_tbl=[$0]) + // HiveAntiJoin(condition=[=($0, $1)], joinType=[anti]) + // + public HiveJoinWithFilterToAntiJoinRule() { +super(operand(Project.class, operand(Filter.class, operand(Join.class, RelOptRule.any(, +"HiveJoinWithFilterToAntiJoinRule:filter"); + } + + // is null filter over a left join. + public void onMatch(final RelOptRuleCall call) { +final Project project = call.rel(0); +final Filter filter = call.rel(1); +final Join join = call.rel(2); +perform(call, project, filter, join); + } + + protected void perform(RelOptRuleCall call, Project project, Filter filter, Join join) { +LOG.debug("Matched HiveAntiJoinRule"); + +if (join.getCondition().isAlwaysTrue()) { + return; +} + +//We support conversion from left outer join only. +if (join.getJoinType() != JoinRelType.LEFT) { + return; +} + +assert (filter != null); + +List aboveFilters = RelOptUtil.conjunctions(filter.getCondition()); +boolean hasIsNull = false; + +// Get all filter condition and check if any of them is a "is null" kind. +for (RexNode filterNode : aboveFilters) { + if (filterNode.getKind() == SqlKind.IS_NULL && + isFilterFromRightSide(join, filterNode, join.getJoinType())) { +hasIsNull = true; +break; + } +} + +// Is null should be on a key from right side of the join. +if (!hasIsNull) { + return; +} + +// Build anti join with same left, right child and condition as original left outer join. +Join anti = join.copy(join.getTraitSet(), join.getCondition(), +join.getLeft(), join.getRight(), JoinRelType.ANTI, false); + +//TODO : Do we really need it +call.getPlanner().onCopy(join, anti); + +RelNode newProject = getNewProjectNode(project, anti); +if (newProject != null) { + call.getPlanner().onCopy(project, newProject); + call.transformTo(newProject); +} + } + + protected RelNode getNewProjectNode(Project oldProject, Join newJoin) { Review comment: I didn't find any such utility method, so added thi
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462816&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462816 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 24/Jul/20 03:57 Start Date: 24/Jul/20 03:57 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459841087 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinProjectTransposeRule.java ## @@ -133,6 +135,10 @@ private HiveJoinProjectTransposeRuleBase( public void onMatch(RelOptRuleCall call) { //TODO: this can be removed once CALCITE-3824 is released + Join joinRel = call.rel(0); + if (joinRel.getJoinType() == JoinRelType.ANTI) { Review comment: This was causing some issue with having clause. https://issues.apache.org/jira/browse/HIVE-23921 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462816) Time Spent: 8.5h (was: 8h 20m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 8.5h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462815&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462815 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 24/Jul/20 03:52 Start Date: 24/Jul/20 03:52 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459840139 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinConstraintsRule.java ## @@ -100,7 +100,8 @@ public void onMatch(RelOptRuleCall call) { // These boolean values represent corresponding left, right input which is potential FK boolean leftInputPotentialFK = topRefs.intersects(leftBits); boolean rightInputPotentialFK = topRefs.intersects(rightBits); -if (leftInputPotentialFK && rightInputPotentialFK && (joinType == JoinRelType.INNER || joinType == JoinRelType.SEMI)) { +if (leftInputPotentialFK && rightInputPotentialFK && +(joinType == JoinRelType.INNER || joinType == JoinRelType.SEMI || joinType == JoinRelType.ANTI)) { Review comment: https://issues.apache.org/jira/browse/HIVE-23920 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462815) Time Spent: 8h 20m (was: 8h 10m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 8h 20m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462811&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462811 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 24/Jul/20 02:41 Start Date: 24/Jul/20 02:41 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459827002 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinAddNotNullRule.java ## @@ -74,7 +78,14 @@ public HiveJoinAddNotNullRule(Class clazz, @Override public void onMatch(RelOptRuleCall call) { Join join = call.rel(0); -if (join.getJoinType() == JoinRelType.FULL || join.getCondition().isAlwaysTrue()) { + +// For anti join case add the not null on right side if the condition is +// always true. This is done because during execution, anti join expect the right side to +// be empty and if we dont put null check on right, for null only right side table and condition +// always true, execution will produce 0 records. +// eg select * from left_tbl where (select 1 from all_null_right limit 1) is null +if (join.getJoinType() == JoinRelType.FULL || +(join.getJoinType() != JoinRelType.ANTI && join.getCondition().isAlwaysTrue())) { Review comment: Yes, the comment is not proper. It's like we will add a not null condition for anti join even if the condition is always true. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462811) Time Spent: 8h 10m (was: 8h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 8h 10m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462810&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462810 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 24/Jul/20 02:39 Start Date: 24/Jul/20 02:39 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459826636 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/reloperators/HiveAntiJoin.java ## @@ -0,0 +1,95 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.reloperators; + +import com.google.common.collect.ImmutableList; +import com.google.common.collect.Sets; +import org.apache.calcite.plan.RelOptCluster; +import org.apache.calcite.plan.RelTraitSet; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.core.Join; +import org.apache.calcite.rel.core.JoinRelType; +import org.apache.calcite.rel.type.RelDataTypeField; +import org.apache.calcite.rex.RexNode; +import org.apache.hadoop.hive.ql.optimizer.calcite.CalciteSemanticException; +import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelOptUtil; +import org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveRulesRegistry; + +import java.util.ArrayList; +import java.util.List; + +public class HiveAntiJoin extends Join implements HiveRelNode { Review comment: https://issues.apache.org/jira/browse/HIVE-23919 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462810) Time Spent: 8h (was: 7h 50m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 8h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converte
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462805&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462805 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 24/Jul/20 02:35 Start Date: 24/Jul/20 02:35 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459825977 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/reloperators/HiveAntiJoin.java ## @@ -0,0 +1,95 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.reloperators; + +import com.google.common.collect.ImmutableList; +import com.google.common.collect.Sets; +import org.apache.calcite.plan.RelOptCluster; +import org.apache.calcite.plan.RelTraitSet; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.core.Join; +import org.apache.calcite.rel.core.JoinRelType; +import org.apache.calcite.rel.type.RelDataTypeField; +import org.apache.calcite.rex.RexNode; +import org.apache.hadoop.hive.ql.optimizer.calcite.CalciteSemanticException; +import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelOptUtil; +import org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveRulesRegistry; + +import java.util.ArrayList; +import java.util.List; + +public class HiveAntiJoin extends Join implements HiveRelNode { + + private final RexNode joinFilter; Review comment: The joinjoinFilter holds the residual filter which is used during post processing. These are the join conditions that are not part of the join key. I think condition in Join hold the full condition. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462805) Time Spent: 7h 50m (was: 7h 40m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 7h 50m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462452&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462452 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 23/Jul/20 09:03 Start Date: 23/Jul/20 09:03 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459310372 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinAddNotNullRule.java ## @@ -74,7 +78,14 @@ public HiveJoinAddNotNullRule(Class clazz, @Override public void onMatch(RelOptRuleCall call) { Join join = call.rel(0); -if (join.getJoinType() == JoinRelType.FULL || join.getCondition().isAlwaysTrue()) { + +// For anti join case add the not null on right side if the condition is Review comment: Thanks! Makes sense now This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462452) Time Spent: 7h 40m (was: 7.5h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 7h 40m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462450&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462450 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 23/Jul/20 09:00 Start Date: 23/Jul/20 09:00 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459308986 ## File path: ql/src/java/org/apache/hadoop/hive/ql/ppd/SyntheticJoinPredicate.java ## @@ -339,6 +339,12 @@ String getFuncText(String funcText, final int srcPos) { vector.add(right, left); break; case JoinDesc.LEFT_OUTER_JOIN: +case JoinDesc.ANTI_JOIN: +//TODO : In case of anti join, bloom filter can be created on left side also ("IN (keylist right table)"). +// But the filter should be "not-in" ("NOT IN (keylist right table)") as we want to select the records from +// left side which are not present in the right side. But it may cause wrong result as +// bloom filter may have false positive and thus simply adding not is not correct, +// special handling is required for "NOT IN". Review comment: Thank Mahesh! Has this in the back of my head for a while -- this will be useful for a bunch of cases including anti-joins ## File path: ql/src/java/org/apache/hadoop/hive/ql/ppd/SyntheticJoinPredicate.java ## @@ -339,6 +339,12 @@ String getFuncText(String funcText, final int srcPos) { vector.add(right, left); break; case JoinDesc.LEFT_OUTER_JOIN: +case JoinDesc.ANTI_JOIN: +//TODO : In case of anti join, bloom filter can be created on left side also ("IN (keylist right table)"). +// But the filter should be "not-in" ("NOT IN (keylist right table)") as we want to select the records from +// left side which are not present in the right side. But it may cause wrong result as +// bloom filter may have false positive and thus simply adding not is not correct, +// special handling is required for "NOT IN". Review comment: Thank Mahesh! Had this in the back of my head for a while -- this will be useful for a bunch of cases including anti-joins This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462450) Time Spent: 7.5h (was: 7h 20m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 7.5h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exis
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462448&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462448 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 23/Jul/20 08:58 Start Date: 23/Jul/20 08:58 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459307921 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// TODO : Duplicate codes need to merge with semi join. +// Single-Column Long hash table import. +// Single-Column Long specific imports. + +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column Long + * using a hash set. + */ +public class VectorMapJoinAntiJoinLongOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final String CLASS_NAME = VectorMapJoinAntiJoinLongOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // The above members are initialized by the constructor and must not be + // transient. + + // The hash map for this specialized class. + private transient VectorMapJoinLongHashSet hashSet; + + // Single-Column Long specific members. + // For integers, we have optional min/max filtering. + private transient boolean useMinMax; + private transient long min; + private transient long max; + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + // Pass-thru constructors. + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinLongOperator() { +super(); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + // Process Single-Column Long Anti Join on a vectorized row batch. + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Initialize Single-Column Long members for this specialized class. +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +// Get our Single-Column Long hash set information for this specialized class. +hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable; +useMinMax = hashSet.useMinMax(); +if (useMinMax) { + min = hashSet.min(); + max = hashSet.max(); +} + } + + @Override + public void processBatch(VectorizedRowBatch batch) throws HiveException { + +try { + // (Currently none) + // antiPerBatchSetup(batch); + + // For anti joins, we may apply the filter(s) now. + for(VectorExpression ve : bigTableFilterExpressions) { +ve.evaluate(batch); + } +
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462446&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462446 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 23/Jul/20 08:57 Start Date: 23/Jul/20 08:57 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459307547 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ## @@ -523,11 +533,19 @@ private boolean createForwardJoinObject(boolean[] skip) throws HiveException { forward = true; } } +return forward; + } + + // returns whether a record was forwarded + private boolean createForwardJoinObject(boolean[] skip, boolean antiJoin) throws HiveException { +boolean forward = fillFwdCache(skip); if (forward) { if (needsPostEvaluation) { forward = !JoinUtil.isFiltered(forwardCache, residualJoinFilters, residualJoinFiltersOIs); } - if (forward) { + + // For anti join, check all right side and if nothing is matched then only forward. Review comment: Ok makes sense now -- so maybe we should just mention that for anti-join we dont forward at this point This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462446) Time Spent: 7h 10m (was: 7h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 7h 10m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462417&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462417 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 23/Jul/20 07:09 Start Date: 23/Jul/20 07:09 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459253392 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveRelOptUtil.java ## @@ -747,7 +747,7 @@ public static RewritablePKFKJoinInfo isRewritablePKFKJoin(Join join, final RelNode nonFkInput = leftInputPotentialFK ? join.getRight() : join.getLeft(); final RewritablePKFKJoinInfo nonRewritable = RewritablePKFKJoinInfo.of(false, null); -if (joinType != JoinRelType.INNER && !join.isSemiJoin()) { +if (joinType != JoinRelType.INNER && !join.isSemiJoin() && joinType != JoinRelType.ANTI) { Review comment: https://issues.apache.org/jira/browse/HIVE-23906 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462417) Time Spent: 7h (was: 6h 50m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 7h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462394&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462394 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 23/Jul/20 05:25 Start Date: 23/Jul/20 05:25 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459220734 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinGenerateResultOperator.java ## @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSet; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSetResult; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashTableResult; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; + +// TODO : This class is duplicate of semi join. Need to do a refactoring to merge it with semi join. Review comment: https://issues.apache.org/jira/browse/HIVE-23905 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462394) Time Spent: 6h 50m (was: 6h 40m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, jus
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462392&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462392 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 23/Jul/20 05:22 Start Date: 23/Jul/20 05:22 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459219966 ## File path: ql/src/test/org/apache/hadoop/hive/ql/exec/vector/mapjoin/TestMapJoinOperator.java ## @@ -1792,6 +1794,8 @@ private void executeTest(MapJoinTestDescription testDesc, MapJoinTestData testDa case FULL_OUTER: executeTestFullOuter(testDesc, testData, title); break; +case ANTI: //TODO Review comment: https://issues.apache.org/jira/browse/HIVE-23904 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462392) Time Spent: 6h 40m (was: 6.5h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 6h 40m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462387&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462387 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 23/Jul/20 05:08 Start Date: 23/Jul/20 05:08 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459216878 ## File path: ql/src/java/org/apache/hadoop/hive/ql/ppd/SyntheticJoinPredicate.java ## @@ -339,6 +339,12 @@ String getFuncText(String funcText, final int srcPos) { vector.add(right, left); break; case JoinDesc.LEFT_OUTER_JOIN: +case JoinDesc.ANTI_JOIN: +//TODO : In case of anti join, bloom filter can be created on left side also ("IN (keylist right table)"). +// But the filter should be "not-in" ("NOT IN (keylist right table)") as we want to select the records from +// left side which are not present in the right side. But it may cause wrong result as +// bloom filter may have false positive and thus simply adding not is not correct, +// special handling is required for "NOT IN". Review comment: created a Jira ..https://issues.apache.org/jira/browse/HIVE-23903 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462387) Time Spent: 6.5h (was: 6h 20m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 6.5h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462384&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462384 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 23/Jul/20 05:01 Start Date: 23/Jul/20 05:01 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459215352 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinAddNotNullRule.java ## @@ -74,7 +78,14 @@ public HiveJoinAddNotNullRule(Class clazz, @Override public void onMatch(RelOptRuleCall call) { Join join = call.rel(0); -if (join.getJoinType() == JoinRelType.FULL || join.getCondition().isAlwaysTrue()) { + +// For anti join case add the not null on right side if the condition is Review comment: For the case when we have. join condition which gets evaluated, it will return false while comparing with a null on the right side. But for always true join condition, we will not do a match for right side assuming it's always true. So for anti join, the left side records will not be emitted. To avoid this we put a null check on right side table and for all null entry, no records will be projected from right side and thus all records from left side will be emitted. So the comment is not very accurate. It's like even if the condition is always true, we add a null check on right side for anti join. I will update it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462384) Time Spent: 6h 20m (was: 6h 10m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 6h 20m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462380&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462380 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 23/Jul/20 04:47 Start Date: 23/Jul/20 04:47 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459212369 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// TODO : Duplicate codes need to merge with semi join. +// Single-Column Long hash table import. +// Single-Column Long specific imports. + +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column Long + * using a hash set. + */ +public class VectorMapJoinAntiJoinLongOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final String CLASS_NAME = VectorMapJoinAntiJoinLongOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // The above members are initialized by the constructor and must not be + // transient. + + // The hash map for this specialized class. + private transient VectorMapJoinLongHashSet hashSet; + + // Single-Column Long specific members. + // For integers, we have optional min/max filtering. + private transient boolean useMinMax; + private transient long min; + private transient long max; + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + // Pass-thru constructors. + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinLongOperator() { +super(); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + // Process Single-Column Long Anti Join on a vectorized row batch. + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Initialize Single-Column Long members for this specialized class. +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +// Get our Single-Column Long hash set information for this specialized class. +hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable; +useMinMax = hashSet.useMinMax(); +if (useMinMax) { + min = hashSet.min(); + max = hashSet.max(); +} + } + + @Override + public void processBatch(VectorizedRowBatch batch) throws HiveException { + +try { + // (Currently none) + // antiPerBatchSetup(batch); + + // For anti joins, we may apply the filter(s) now. + for(VectorExpression ve : bigTableFilterExpressions) { +ve.evaluate(batch); +
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462376&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462376 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 23/Jul/20 04:26 Start Date: 23/Jul/20 04:26 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459208067 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// TODO : Duplicate codes need to merge with semi join. +// Single-Column Long hash table import. +// Single-Column Long specific imports. + +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column Long + * using a hash set. + */ +public class VectorMapJoinAntiJoinLongOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final String CLASS_NAME = VectorMapJoinAntiJoinLongOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // The above members are initialized by the constructor and must not be + // transient. + + // The hash map for this specialized class. + private transient VectorMapJoinLongHashSet hashSet; + + // Single-Column Long specific members. + // For integers, we have optional min/max filtering. + private transient boolean useMinMax; + private transient long min; + private transient long max; + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + // Pass-thru constructors. + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinLongOperator() { +super(); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + // Process Single-Column Long Anti Join on a vectorized row batch. + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Initialize Single-Column Long members for this specialized class. +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +// Get our Single-Column Long hash set information for this specialized class. +hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable; +useMinMax = hashSet.useMinMax(); +if (useMinMax) { + min = hashSet.min(); + max = hashSet.max(); +} + } + + @Override + public void processBatch(VectorizedRowBatch batch) throws HiveException { + +try { + // (Currently none) Review comment: The pre batch processing done only for joins which emits the right table records. For semi join and anti join, it's not required. ---
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462362&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462362 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 23/Jul/20 03:39 Start Date: 23/Jul/20 03:39 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r459198718 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ## @@ -523,11 +533,19 @@ private boolean createForwardJoinObject(boolean[] skip) throws HiveException { forward = true; } } +return forward; + } + + // returns whether a record was forwarded + private boolean createForwardJoinObject(boolean[] skip, boolean antiJoin) throws HiveException { +boolean forward = fillFwdCache(skip); if (forward) { if (needsPostEvaluation) { forward = !JoinUtil.isFiltered(forwardCache, residualJoinFilters, residualJoinFiltersOIs); } - if (forward) { + + // For anti join, check all right side and if nothing is matched then only forward. Review comment: For anti join we don't emit the record here. It's done after all the records are checked and none of the record matches the condition. Here if forward is false we don't forward and as its a "&" we don't forward for anti join == true even if forward is true. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 462362) Time Spent: 5h 50m (was: 5h 40m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 5h 50m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461841&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461841 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 22/Jul/20 03:25 Start Date: 22/Jul/20 03:25 Worklog Time Spent: 10m Work Description: jcamachor commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458511220 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java ## @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.rules; + +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptUtil; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.core.Filter; +import org.apache.calcite.rel.core.Join; +import org.apache.calcite.rel.core.JoinRelType; +import org.apache.calcite.rel.core.Project; +import org.apache.calcite.rel.type.RelDataTypeField; +import org.apache.calcite.rex.RexInputRef; +import org.apache.calcite.rex.RexNode; +import org.apache.calcite.sql.SqlKind; +import org.apache.calcite.util.ImmutableBitSet; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; + +/** + * Planner rule that converts a join plus filter to anti join. + */ +public class HiveJoinWithFilterToAntiJoinRule extends RelOptRule { + protected static final Logger LOG = LoggerFactory.getLogger(HiveJoinWithFilterToAntiJoinRule.class); + public static final HiveJoinWithFilterToAntiJoinRule INSTANCE = new HiveJoinWithFilterToAntiJoinRule(); + + //HiveProject(fld=[$0]) + // HiveFilter(condition=[IS NULL($1)]) + //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], cost=[not available]) + // + // TO + // + //HiveProject(fld_tbl=[$0]) + // HiveAntiJoin(condition=[=($0, $1)], joinType=[anti]) + // + public HiveJoinWithFilterToAntiJoinRule() { +super(operand(Project.class, operand(Filter.class, operand(Join.class, RelOptRule.any(, +"HiveJoinWithFilterToAntiJoinRule:filter"); + } + + // is null filter over a left join. + public void onMatch(final RelOptRuleCall call) { +final Project project = call.rel(0); +final Filter filter = call.rel(1); +final Join join = call.rel(2); +perform(call, project, filter, join); + } + + protected void perform(RelOptRuleCall call, Project project, Filter filter, Join join) { +LOG.debug("Matched HiveAntiJoinRule"); + +assert (filter != null); + +//We support conversion from left outer join only. +if (join.getJoinType() != JoinRelType.LEFT) { + return; +} + +List aboveFilters = RelOptUtil.conjunctions(filter.getCondition()); +boolean hasIsNull = false; + +// Get all filter condition and check if any of them is a "is null" kind. +for (RexNode filterNode : aboveFilters) { + if (filterNode.getKind() == SqlKind.IS_NULL && + isFilterFromRightSide(join, filterNode, join.getJoinType())) { +hasIsNull = true; +break; + } +} + +// Is null should be on a key from right side of the join. +if (!hasIsNull) { + return; +} + +// Build anti join with same left, right child and condition as original left outer join. +Join anti = join.copy(join.getTraitSet(), join.getCondition(), Review comment: Probably it is here where we do not create the antijoin operator explicitly and why we end up with normal joins in Calcite plan. Since we are creating SemiJoin and AntiJoin as different operators, I think we should follow that pattern here and create an antijoin explicitely. Nevertheless, we could possibly get rid of HiveAntiJoin and HiveSemiJoin all together as I mentioned in another comment, but that can be part of another JIRA. This is an automated message from the A
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461842&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461842 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 22/Jul/20 03:25 Start Date: 22/Jul/20 03:25 Worklog Time Spent: 10m Work Description: jcamachor commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458511220 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java ## @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.rules; + +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptUtil; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.core.Filter; +import org.apache.calcite.rel.core.Join; +import org.apache.calcite.rel.core.JoinRelType; +import org.apache.calcite.rel.core.Project; +import org.apache.calcite.rel.type.RelDataTypeField; +import org.apache.calcite.rex.RexInputRef; +import org.apache.calcite.rex.RexNode; +import org.apache.calcite.sql.SqlKind; +import org.apache.calcite.util.ImmutableBitSet; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; + +/** + * Planner rule that converts a join plus filter to anti join. + */ +public class HiveJoinWithFilterToAntiJoinRule extends RelOptRule { + protected static final Logger LOG = LoggerFactory.getLogger(HiveJoinWithFilterToAntiJoinRule.class); + public static final HiveJoinWithFilterToAntiJoinRule INSTANCE = new HiveJoinWithFilterToAntiJoinRule(); + + //HiveProject(fld=[$0]) + // HiveFilter(condition=[IS NULL($1)]) + //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], cost=[not available]) + // + // TO + // + //HiveProject(fld_tbl=[$0]) + // HiveAntiJoin(condition=[=($0, $1)], joinType=[anti]) + // + public HiveJoinWithFilterToAntiJoinRule() { +super(operand(Project.class, operand(Filter.class, operand(Join.class, RelOptRule.any(, +"HiveJoinWithFilterToAntiJoinRule:filter"); + } + + // is null filter over a left join. + public void onMatch(final RelOptRuleCall call) { +final Project project = call.rel(0); +final Filter filter = call.rel(1); +final Join join = call.rel(2); +perform(call, project, filter, join); + } + + protected void perform(RelOptRuleCall call, Project project, Filter filter, Join join) { +LOG.debug("Matched HiveAntiJoinRule"); + +assert (filter != null); + +//We support conversion from left outer join only. +if (join.getJoinType() != JoinRelType.LEFT) { + return; +} + +List aboveFilters = RelOptUtil.conjunctions(filter.getCondition()); +boolean hasIsNull = false; + +// Get all filter condition and check if any of them is a "is null" kind. +for (RexNode filterNode : aboveFilters) { + if (filterNode.getKind() == SqlKind.IS_NULL && + isFilterFromRightSide(join, filterNode, join.getJoinType())) { +hasIsNull = true; +break; + } +} + +// Is null should be on a key from right side of the join. +if (!hasIsNull) { + return; +} + +// Build anti join with same left, right child and condition as original left outer join. +Join anti = join.copy(join.getTraitSet(), join.getCondition(), Review comment: Probably it is here where we do not create the antijoin operator explicitly and why we end up with normal joins in Calcite plan. Since we are creating SemiJoin and AntiJoin as different operators, I think we should follow that pattern here and create an antijoin explicitly or using the builder (you can look at `HiveSemiJoinRule`). Nevertheless, we could possibly get rid of HiveAntiJoin and HiveSemiJoin all together as I mentioned in another comment, but that can be part of another JIRA. --
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461840&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461840 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 22/Jul/20 03:19 Start Date: 22/Jul/20 03:19 Worklog Time Spent: 10m Work Description: jcamachor commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r457831207 ## File path: parser/src/java/org/apache/hadoop/hive/ql/parse/FromClauseParser.g ## @@ -145,6 +145,7 @@ joinToken | KW_RIGHT (KW_OUTER)? KW_JOIN -> TOK_RIGHTOUTERJOIN | KW_FULL (KW_OUTER)? KW_JOIN -> TOK_FULLOUTERJOIN | KW_LEFT KW_SEMI KW_JOIN -> TOK_LEFTSEMIJOIN +| KW_ANTI KW_JOIN -> TOK_ANTIJOIN Review comment: Since we are exposing this and to prevent any ambiguity, should we use: `KW_LEFT KW_ANTI KW_JOIN -> TOK_LEFTANTISEMIJOIN` ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ## @@ -509,11 +513,17 @@ protected void addToAliasFilterTags(byte alias, List object, boolean isN } } + private void createForwardJoinObjectForAntiJoin(boolean[] skip) throws HiveException { +boolean forward = fillFwdCache(skip); Review comment: nit. Fwd -> Forward ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveRelOptUtil.java ## @@ -747,7 +747,7 @@ public static RewritablePKFKJoinInfo isRewritablePKFKJoin(Join join, final RelNode nonFkInput = leftInputPotentialFK ? join.getRight() : join.getLeft(); final RewritablePKFKJoinInfo nonRewritable = RewritablePKFKJoinInfo.of(false, null); -if (joinType != JoinRelType.INNER && !join.isSemiJoin()) { +if (joinType != JoinRelType.INNER && !join.isSemiJoin() && joinType != JoinRelType.ANTI) { Review comment: This is interesting. An antijoin of a PK-FK join returns no rows? Can we create a JIRA for such optimization based on integrity constraints? ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinConstraintsRule.java ## @@ -100,7 +100,8 @@ public void onMatch(RelOptRuleCall call) { // These boolean values represent corresponding left, right input which is potential FK boolean leftInputPotentialFK = topRefs.intersects(leftBits); boolean rightInputPotentialFK = topRefs.intersects(rightBits); -if (leftInputPotentialFK && rightInputPotentialFK && (joinType == JoinRelType.INNER || joinType == JoinRelType.SEMI)) { +if (leftInputPotentialFK && rightInputPotentialFK && +(joinType == JoinRelType.INNER || joinType == JoinRelType.SEMI || joinType == JoinRelType.ANTI)) { Review comment: This is not correct and needs further thinking. If we have a PK-FK join that is only appending columns to the FK side, it basically means it is not filtering anything (everything is matching). If that is the case, then ANTIJOIN result would be empty? We could detect this at planning time and trigger the rewriting. Could we bail out from the rule if it is an ANTIJOIN and create a follow-up JIRA to tackle this and introduce further tests? ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java ## @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.rules; + +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptUtil; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.core.Filter; +import org.apache.calcite.rel.core.Join; +import org.apache.calcite.rel.core.JoinRelType; +import org.apache.calcite.rel.core.Project; +import org.apache.calcite.rel.type.RelDataTypeField; +import org.apache.calcite.rex.RexInputRef; +import org.apache.calcite.rex.RexNode; +import org.apache.calcite.sql.SqlKind; +import org.apache.calcite.util.ImmutableBitSet; +import org.slf4j.Logger; +import
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461650&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461650 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 15:53 Start Date: 21/Jul/20 15:53 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458205456 ## File path: ql/src/java/org/apache/hadoop/hive/ql/ppd/SyntheticJoinPredicate.java ## @@ -339,6 +339,12 @@ String getFuncText(String funcText, final int srcPos) { vector.add(right, left); break; case JoinDesc.LEFT_OUTER_JOIN: +case JoinDesc.ANTI_JOIN: +//TODO : In case of anti join, bloom filter can be created on left side also ("IN (keylist right table)"). +// But the filter should be "not-in" ("NOT IN (keylist right table)") as we want to select the records from +// left side which are not present in the right side. But it may cause wrong result as +// bloom filter may have false positive and thus simply adding not is not correct, +// special handling is required for "NOT IN". Review comment: Makes sense, for this particular purpose in the future we could something like ``The opossite bloom filter`` to support such cases https://github.com/jmhodges/opposite_of_a_bloom_filter/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 461650) Time Spent: 5h (was: 4h 50m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 5h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461651&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461651 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 15:53 Start Date: 21/Jul/20 15:53 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458205958 ## File path: ql/src/test/org/apache/hadoop/hive/ql/exec/vector/mapjoin/TestMapJoinOperator.java ## @@ -1792,6 +1794,8 @@ private void executeTest(MapJoinTestDescription testDesc, MapJoinTestData testDa case FULL_OUTER: executeTestFullOuter(testDesc, testData, title); break; +case ANTI: //TODO Review comment: Shall we open a ticket to track this? What is the main challenge here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 461651) Time Spent: 5h 10m (was: 5h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 5h 10m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461643&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461643 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 15:33 Start Date: 21/Jul/20 15:33 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458190642 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinAddNotNullRule.java ## @@ -74,7 +78,14 @@ public HiveJoinAddNotNullRule(Class clazz, @Override public void onMatch(RelOptRuleCall call) { Join join = call.rel(0); -if (join.getJoinType() == JoinRelType.FULL || join.getCondition().isAlwaysTrue()) { + +// For anti join case add the not null on right side if the condition is Review comment: Not sure I understand the issue here -- is the problem the fact that ANTI-join matches with NULL rows on the right side? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 461643) Time Spent: 4h 50m (was: 4h 40m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 4h 50m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461637&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461637 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 15:19 Start Date: 21/Jul/20 15:19 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458180264 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinStringOperator.java ## @@ -0,0 +1,371 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.StringExpr; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinBytesHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// Single-Column String hash table import. +// Single-Column String specific imports. + +// TODO : Duplicate codes need to merge with semi join. +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column String + * using a hash set. + */ +public class VectorMapJoinAntiJoinStringOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + + // + + private static final String CLASS_NAME = VectorMapJoinAntiJoinStringOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // + + // (none) + + // The above members are initialized by the constructor and must not be + // transient. + //--- + + // The hash map for this specialized class. + private transient VectorMapJoinBytesHashSet hashSet; + + //--- + // Single-Column String specific members. + // + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + //--- + // Pass-thru constructors. + // + + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinStringOperator() { +super(); + } + + public VectorMapJoinAntiJoinStringOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinStringOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + //--- + // Process Single-Column String anti Join on a vectorized row batch. + // + + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +/* + * Initialize Single-Column String members for this specialized class. + */ + +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +/* + * Get our Single-Column String hash set information for this
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461636&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461636 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 15:17 Start Date: 21/Jul/20 15:17 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458178786 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinMultiKeyOperator.java ## @@ -0,0 +1,400 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.VectorSerializeRow; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinBytesHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.apache.hadoop.hive.serde2.ByteStream.Output; +import org.apache.hadoop.hive.serde2.binarysortable.fast.BinarySortableSerializeWrite; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// Multi-Key hash table import. +// Multi-Key specific imports. + +// TODO : Duplicate codes need to merge with semi join. +/* + * Specialized class for doing a vectorized map join that is an anti join on Multi-Key + * using hash set. + */ +public class VectorMapJoinAntiJoinMultiKeyOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + + // + + private static final String CLASS_NAME = VectorMapJoinAntiJoinMultiKeyOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // + + // (none) + + // The above members are initialized by the constructor and must not be + // transient. + //--- + + // The hash map for this specialized class. + private transient VectorMapJoinBytesHashSet hashSet; + + //--- + // Multi-Key specific members. + // + + // Object that can take a set of columns in row in a vectorized row batch and serialized it. + // Known to not have any nulls. + private transient VectorSerializeRow keyVectorSerializeWrite; + + // The BinarySortable serialization of the current key. + private transient Output currentKeyOutput; + + // The BinarySortable serialization of the saved key for a possible series of equal keys. + private transient Output saveKeyOutput; + + //--- + // Pass-thru constructors. + // + + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinMultiKeyOperator() { +super(); + } + + public VectorMapJoinAntiJoinMultiKeyOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinMultiKeyOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + //--- + // Process Multi-Key Anti Join on a vectorized row batch. + // + + @Override + protected void commonSetup() throws H
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461628&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461628 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 14:58 Start Date: 21/Jul/20 14:58 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458164592 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// TODO : Duplicate codes need to merge with semi join. +// Single-Column Long hash table import. +// Single-Column Long specific imports. + +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column Long + * using a hash set. + */ +public class VectorMapJoinAntiJoinLongOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final String CLASS_NAME = VectorMapJoinAntiJoinLongOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // The above members are initialized by the constructor and must not be + // transient. + + // The hash map for this specialized class. + private transient VectorMapJoinLongHashSet hashSet; + + // Single-Column Long specific members. + // For integers, we have optional min/max filtering. + private transient boolean useMinMax; + private transient long min; + private transient long max; + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + // Pass-thru constructors. + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinLongOperator() { +super(); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + // Process Single-Column Long Anti Join on a vectorized row batch. + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Initialize Single-Column Long members for this specialized class. +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +// Get our Single-Column Long hash set information for this specialized class. +hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable; +useMinMax = hashSet.useMinMax(); +if (useMinMax) { + min = hashSet.min(); + max = hashSet.max(); +} + } + + @Override + public void processBatch(VectorizedRowBatch batch) throws HiveException { + +try { + // (Currently none) + // antiPerBatchSetup(batch); + + // For anti joins, we may apply the filter(s) now. + for(VectorExpression ve : bigTableFilterExpressions) { +ve.evaluate(batch); + } +
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461633&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461633 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 15:05 Start Date: 21/Jul/20 15:05 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458169457 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// TODO : Duplicate codes need to merge with semi join. +// Single-Column Long hash table import. +// Single-Column Long specific imports. + +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column Long + * using a hash set. + */ +public class VectorMapJoinAntiJoinLongOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final String CLASS_NAME = VectorMapJoinAntiJoinLongOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // The above members are initialized by the constructor and must not be + // transient. + + // The hash map for this specialized class. + private transient VectorMapJoinLongHashSet hashSet; + + // Single-Column Long specific members. + // For integers, we have optional min/max filtering. + private transient boolean useMinMax; + private transient long min; + private transient long max; + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + // Pass-thru constructors. + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinLongOperator() { +super(); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + // Process Single-Column Long Anti Join on a vectorized row batch. + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Initialize Single-Column Long members for this specialized class. +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +// Get our Single-Column Long hash set information for this specialized class. +hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable; +useMinMax = hashSet.useMinMax(); +if (useMinMax) { + min = hashSet.min(); + max = hashSet.max(); +} + } + + @Override + public void processBatch(VectorizedRowBatch batch) throws HiveException { + +try { + // (Currently none) + // antiPerBatchSetup(batch); + + // For anti joins, we may apply the filter(s) now. + for(VectorExpression ve : bigTableFilterExpressions) { +ve.evaluate(batch); + } +
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461634&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461634 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 15:05 Start Date: 21/Jul/20 15:05 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458169723 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinMultiKeyOperator.java ## @@ -0,0 +1,400 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.VectorSerializeRow; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinBytesHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.apache.hadoop.hive.serde2.ByteStream.Output; +import org.apache.hadoop.hive.serde2.binarysortable.fast.BinarySortableSerializeWrite; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// Multi-Key hash table import. +// Multi-Key specific imports. + +// TODO : Duplicate codes need to merge with semi join. +/* + * Specialized class for doing a vectorized map join that is an anti join on Multi-Key + * using hash set. + */ +public class VectorMapJoinAntiJoinMultiKeyOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + + // + + private static final String CLASS_NAME = VectorMapJoinAntiJoinMultiKeyOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // + + // (none) + + // The above members are initialized by the constructor and must not be + // transient. + //--- + + // The hash map for this specialized class. + private transient VectorMapJoinBytesHashSet hashSet; + + //--- + // Multi-Key specific members. + // + + // Object that can take a set of columns in row in a vectorized row batch and serialized it. + // Known to not have any nulls. + private transient VectorSerializeRow keyVectorSerializeWrite; + + // The BinarySortable serialization of the current key. + private transient Output currentKeyOutput; + + // The BinarySortable serialization of the saved key for a possible series of equal keys. + private transient Output saveKeyOutput; + + //--- + // Pass-thru constructors. + // + + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinMultiKeyOperator() { +super(); + } + + public VectorMapJoinAntiJoinMultiKeyOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinMultiKeyOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + //--- + // Process Multi-Key Anti Join on a vectorized row batch. + // + + @Override + protected void commonSetup() throws H
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461630&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461630 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 15:02 Start Date: 21/Jul/20 15:02 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458167104 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// TODO : Duplicate codes need to merge with semi join. +// Single-Column Long hash table import. +// Single-Column Long specific imports. + +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column Long + * using a hash set. + */ +public class VectorMapJoinAntiJoinLongOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final String CLASS_NAME = VectorMapJoinAntiJoinLongOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // The above members are initialized by the constructor and must not be + // transient. + + // The hash map for this specialized class. + private transient VectorMapJoinLongHashSet hashSet; + + // Single-Column Long specific members. + // For integers, we have optional min/max filtering. + private transient boolean useMinMax; + private transient long min; + private transient long max; + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + // Pass-thru constructors. + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinLongOperator() { +super(); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + // Process Single-Column Long Anti Join on a vectorized row batch. + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Initialize Single-Column Long members for this specialized class. +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +// Get our Single-Column Long hash set information for this specialized class. +hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable; +useMinMax = hashSet.useMinMax(); +if (useMinMax) { + min = hashSet.min(); + max = hashSet.max(); +} + } + + @Override + public void processBatch(VectorizedRowBatch batch) throws HiveException { + +try { + // (Currently none) + // antiPerBatchSetup(batch); + + // For anti joins, we may apply the filter(s) now. + for(VectorExpression ve : bigTableFilterExpressions) { +ve.evaluate(batch); + } +
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461632&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461632 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 15:02 Start Date: 21/Jul/20 15:02 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458167515 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// TODO : Duplicate codes need to merge with semi join. +// Single-Column Long hash table import. +// Single-Column Long specific imports. + +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column Long + * using a hash set. + */ +public class VectorMapJoinAntiJoinLongOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final String CLASS_NAME = VectorMapJoinAntiJoinLongOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // The above members are initialized by the constructor and must not be + // transient. + + // The hash map for this specialized class. + private transient VectorMapJoinLongHashSet hashSet; + + // Single-Column Long specific members. + // For integers, we have optional min/max filtering. + private transient boolean useMinMax; + private transient long min; + private transient long max; + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + // Pass-thru constructors. + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinLongOperator() { +super(); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + // Process Single-Column Long Anti Join on a vectorized row batch. + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Initialize Single-Column Long members for this specialized class. +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +// Get our Single-Column Long hash set information for this specialized class. +hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable; +useMinMax = hashSet.useMinMax(); +if (useMinMax) { + min = hashSet.min(); + max = hashSet.max(); +} + } + + @Override + public void processBatch(VectorizedRowBatch batch) throws HiveException { + +try { + // (Currently none) + // antiPerBatchSetup(batch); + + // For anti joins, we may apply the filter(s) now. + for(VectorExpression ve : bigTableFilterExpressions) { +ve.evaluate(batch); + } +
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461629&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461629 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 15:00 Start Date: 21/Jul/20 15:00 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458165844 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// TODO : Duplicate codes need to merge with semi join. +// Single-Column Long hash table import. +// Single-Column Long specific imports. + +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column Long + * using a hash set. + */ +public class VectorMapJoinAntiJoinLongOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final String CLASS_NAME = VectorMapJoinAntiJoinLongOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // The above members are initialized by the constructor and must not be + // transient. + + // The hash map for this specialized class. + private transient VectorMapJoinLongHashSet hashSet; + + // Single-Column Long specific members. + // For integers, we have optional min/max filtering. + private transient boolean useMinMax; + private transient long min; + private transient long max; + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + // Pass-thru constructors. + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinLongOperator() { +super(); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + // Process Single-Column Long Anti Join on a vectorized row batch. + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Initialize Single-Column Long members for this specialized class. +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +// Get our Single-Column Long hash set information for this specialized class. +hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable; +useMinMax = hashSet.useMinMax(); +if (useMinMax) { + min = hashSet.min(); + max = hashSet.max(); +} + } + + @Override + public void processBatch(VectorizedRowBatch batch) throws HiveException { + +try { + // (Currently none) + // antiPerBatchSetup(batch); + + // For anti joins, we may apply the filter(s) now. + for(VectorExpression ve : bigTableFilterExpressions) { +ve.evaluate(batch); + } +
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461627&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461627 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 14:57 Start Date: 21/Jul/20 14:57 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458162559 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// TODO : Duplicate codes need to merge with semi join. +// Single-Column Long hash table import. +// Single-Column Long specific imports. + +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column Long + * using a hash set. + */ +public class VectorMapJoinAntiJoinLongOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final String CLASS_NAME = VectorMapJoinAntiJoinLongOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // The above members are initialized by the constructor and must not be + // transient. + + // The hash map for this specialized class. + private transient VectorMapJoinLongHashSet hashSet; + + // Single-Column Long specific members. + // For integers, we have optional min/max filtering. + private transient boolean useMinMax; + private transient long min; + private transient long max; + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + // Pass-thru constructors. + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinLongOperator() { +super(); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + // Process Single-Column Long Anti Join on a vectorized row batch. + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Initialize Single-Column Long members for this specialized class. +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +// Get our Single-Column Long hash set information for this specialized class. +hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable; +useMinMax = hashSet.useMinMax(); +if (useMinMax) { + min = hashSet.min(); + max = hashSet.max(); +} + } + + @Override + public void processBatch(VectorizedRowBatch batch) throws HiveException { + +try { + // (Currently none) + // antiPerBatchSetup(batch); + + // For anti joins, we may apply the filter(s) now. + for(VectorExpression ve : bigTableFilterExpressions) { +ve.evaluate(batch); + } +
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461626&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461626 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 14:56 Start Date: 21/Jul/20 14:56 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458162559 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// TODO : Duplicate codes need to merge with semi join. +// Single-Column Long hash table import. +// Single-Column Long specific imports. + +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column Long + * using a hash set. + */ +public class VectorMapJoinAntiJoinLongOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final String CLASS_NAME = VectorMapJoinAntiJoinLongOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // The above members are initialized by the constructor and must not be + // transient. + + // The hash map for this specialized class. + private transient VectorMapJoinLongHashSet hashSet; + + // Single-Column Long specific members. + // For integers, we have optional min/max filtering. + private transient boolean useMinMax; + private transient long min; + private transient long max; + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + // Pass-thru constructors. + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinLongOperator() { +super(); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + // Process Single-Column Long Anti Join on a vectorized row batch. + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Initialize Single-Column Long members for this specialized class. +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +// Get our Single-Column Long hash set information for this specialized class. +hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable; +useMinMax = hashSet.useMinMax(); +if (useMinMax) { + min = hashSet.min(); + max = hashSet.max(); +} + } + + @Override + public void processBatch(VectorizedRowBatch batch) throws HiveException { + +try { + // (Currently none) + // antiPerBatchSetup(batch); + + // For anti joins, we may apply the filter(s) now. + for(VectorExpression ve : bigTableFilterExpressions) { +ve.evaluate(batch); + } +
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461622&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461622 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 14:52 Start Date: 21/Jul/20 14:52 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458158828 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// TODO : Duplicate codes need to merge with semi join. +// Single-Column Long hash table import. +// Single-Column Long specific imports. + +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column Long + * using a hash set. + */ +public class VectorMapJoinAntiJoinLongOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final String CLASS_NAME = VectorMapJoinAntiJoinLongOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // The above members are initialized by the constructor and must not be + // transient. + + // The hash map for this specialized class. + private transient VectorMapJoinLongHashSet hashSet; + + // Single-Column Long specific members. + // For integers, we have optional min/max filtering. + private transient boolean useMinMax; + private transient long min; + private transient long max; + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + // Pass-thru constructors. + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinLongOperator() { +super(); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + // Process Single-Column Long Anti Join on a vectorized row batch. + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Initialize Single-Column Long members for this specialized class. +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +// Get our Single-Column Long hash set information for this specialized class. +hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable; +useMinMax = hashSet.useMinMax(); +if (useMinMax) { + min = hashSet.min(); + max = hashSet.max(); +} + } + + @Override + public void processBatch(VectorizedRowBatch batch) throws HiveException { + +try { + // (Currently none) + // antiPerBatchSetup(batch); + + // For anti joins, we may apply the filter(s) now. + for(VectorExpression ve : bigTableFilterExpressions) { +ve.evaluate(batch); + } +
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461620&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461620 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 14:51 Start Date: 21/Jul/20 14:51 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458158828 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// TODO : Duplicate codes need to merge with semi join. +// Single-Column Long hash table import. +// Single-Column Long specific imports. + +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column Long + * using a hash set. + */ +public class VectorMapJoinAntiJoinLongOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final String CLASS_NAME = VectorMapJoinAntiJoinLongOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // The above members are initialized by the constructor and must not be + // transient. + + // The hash map for this specialized class. + private transient VectorMapJoinLongHashSet hashSet; + + // Single-Column Long specific members. + // For integers, we have optional min/max filtering. + private transient boolean useMinMax; + private transient long min; + private transient long max; + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + // Pass-thru constructors. + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinLongOperator() { +super(); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + // Process Single-Column Long Anti Join on a vectorized row batch. + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Initialize Single-Column Long members for this specialized class. +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +// Get our Single-Column Long hash set information for this specialized class. +hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable; +useMinMax = hashSet.useMinMax(); +if (useMinMax) { + min = hashSet.min(); + max = hashSet.max(); +} + } + + @Override + public void processBatch(VectorizedRowBatch batch) throws HiveException { + +try { + // (Currently none) + // antiPerBatchSetup(batch); + + // For anti joins, we may apply the filter(s) now. + for(VectorExpression ve : bigTableFilterExpressions) { +ve.evaluate(batch); + } +
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461619&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461619 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 14:49 Start Date: 21/Jul/20 14:49 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458157216 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; + +// TODO : Duplicate codes need to merge with semi join. +// Single-Column Long hash table import. +// Single-Column Long specific imports. + +/* + * Specialized class for doing a vectorized map join that is an anti join on a Single-Column Long + * using a hash set. + */ +public class VectorMapJoinAntiJoinLongOperator extends VectorMapJoinAntiJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final String CLASS_NAME = VectorMapJoinAntiJoinLongOperator.class.getName(); + private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME); + protected String getLoggingPrefix() { +return super.getLoggingPrefix(CLASS_NAME); + } + + // The above members are initialized by the constructor and must not be + // transient. + + // The hash map for this specialized class. + private transient VectorMapJoinLongHashSet hashSet; + + // Single-Column Long specific members. + // For integers, we have optional min/max filtering. + private transient boolean useMinMax; + private transient long min; + private transient long max; + + // The column number for this one column join specialization. + private transient int singleJoinColumn; + + // Pass-thru constructors. + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinLongOperator() { +super(); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + // Process Single-Column Long Anti Join on a vectorized row batch. + @Override + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Initialize Single-Column Long members for this specialized class. +singleJoinColumn = bigTableKeyColumnMap[0]; + } + + @Override + public void hashTableSetup() throws HiveException { +super.hashTableSetup(); + +// Get our Single-Column Long hash set information for this specialized class. +hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable; +useMinMax = hashSet.useMinMax(); +if (useMinMax) { + min = hashSet.min(); + max = hashSet.max(); +} + } + + @Override + public void processBatch(VectorizedRowBatch batch) throws HiveException { + +try { + // (Currently none) Review comment: leftover? This is an automated message from the Apache Git Service. To respond to the message, please log
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461616&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461616 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 14:42 Start Date: 21/Jul/20 14:42 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458151829 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinGenerateResultOperator.java ## @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSet; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSetResult; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashTableResult; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; + +// TODO : This class is duplicate of semi join. Need to do a refactoring to merge it with semi join. +/** + * This class has methods for generating vectorized join results for Anti joins. + * The big difference between inner joins and anti joins is existence testing. + * Inner joins use a hash map to lookup the 1 or more small table values. + * Anti joins are a specialized join for outputting big table rows whose key exists Review comment: nit: whose key DOES NOT exist This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 461616) Time Spent: 2.5h (was: 2h 20m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 2.5h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461615&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461615 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 14:37 Start Date: 21/Jul/20 14:37 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458147849 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinGenerateResultOperator.java ## @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin; + +import org.apache.hadoop.hive.ql.CompilationOpContext; +import org.apache.hadoop.hive.ql.exec.JoinUtil; +import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSet; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSetResult; +import org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashTableResult; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.plan.OperatorDesc; +import org.apache.hadoop.hive.ql.plan.VectorDesc; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; + +// TODO : This class is duplicate of semi join. Need to do a refactoring to merge it with semi join. +/** + * This class has methods for generating vectorized join results for Anti joins. + * The big difference between inner joins and anti joins is existence testing. + * Inner joins use a hash map to lookup the 1 or more small table values. + * Anti joins are a specialized join for outputting big table rows whose key exists + * in the small table. + * + * No small table values are needed for anti since they would be empty. So, + * we use a hash set as the hash table. Hash sets just report whether a key exists. This + * is a big performance optimization. + */ +public abstract class VectorMapJoinAntiJoinGenerateResultOperator +extends VectorMapJoinGenerateResultOperator { + + private static final long serialVersionUID = 1L; + private static final Logger LOG = LoggerFactory.getLogger(VectorMapJoinAntiJoinGenerateResultOperator.class.getName()); + + // Anti join specific members. + + // An array of hash set results so we can do lookups on the whole batch before output result + // generation. + protected transient VectorMapJoinHashSetResult hashSetResults[]; + + // Pre-allocated member for storing the (physical) batch index of matching row (single- or + // multi-small-table-valued) indexes during a process call. + protected transient int[] allMatchs; + + // Pre-allocated member for storing the (physical) batch index of rows that need to be spilled. + protected transient int[] spills; + + // Pre-allocated member for storing index into the hashSetResults for each spilled row. + protected transient int[] spillHashMapResultIndices; + + /** Kryo ctor. */ + protected VectorMapJoinAntiJoinGenerateResultOperator() { +super(); + } + + public VectorMapJoinAntiJoinGenerateResultOperator(CompilationOpContext ctx) { +super(ctx); + } + + public VectorMapJoinAntiJoinGenerateResultOperator(CompilationOpContext ctx, OperatorDesc conf, + VectorizationContext vContext, VectorDesc vectorDesc) throws HiveException { +super(ctx, conf, vContext, vectorDesc); + } + + /* + * Setup our anti join specific members. + */ + protected void commonSetup() throws HiveException { +super.commonSetup(); + +// Anti join specific. +VectorMapJoinHashSet baseHashSet = (VectorMapJoinHashSet) vectorMapJoinHashTable; + +hashSetResults = new VectorMapJoinHashSetResult[VectorizedRowBatch.DEFAULT_SIZE]; +for (int i = 0; i < hashSetRe
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461613&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461613 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 14:31 Start Date: 21/Jul/20 14:31 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458143378 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ## @@ -523,11 +533,19 @@ private boolean createForwardJoinObject(boolean[] skip) throws HiveException { forward = true; } } +return forward; + } + + // returns whether a record was forwarded + private boolean createForwardJoinObject(boolean[] skip, boolean antiJoin) throws HiveException { +boolean forward = fillFwdCache(skip); if (forward) { if (needsPostEvaluation) { forward = !JoinUtil.isFiltered(forwardCache, residualJoinFilters, residualJoinFiltersOIs); } - if (forward) { + + // For anti join, check all right side and if nothing is matched then only forward. Review comment: Not sure I fully understand the comment here -- !forward (false) and antijoin (true) will still skip the object This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 461613) Time Spent: 2h 10m (was: 2h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461604&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461604 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 14:22 Start Date: 21/Jul/20 14:22 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458135999 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ## @@ -638,6 +657,12 @@ private void genObject(int aliasNum, boolean allLeftFirst, boolean allLeftNull) // skipping the rest of the rows in the rhs table of the semijoin done = !needsPostEvaluation; } + } else if (type == JoinDesc.ANTI_JOIN) { +if (innerJoin(skip, left, right)) { + // if anti join found a match then the condition is not matched for anti join, so we can skip rest of the Review comment: nit: if inner join found a match. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 461604) Time Spent: 2h (was: 1h 50m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 2h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461562&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461562 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 13:09 Start Date: 21/Jul/20 13:09 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458082243 ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -2162,7 +2162,8 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Whether Hive enables the optimization about converting common join into mapjoin based on the input file size. \n" + "If this parameter is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than the\n" + "specified size, the join is directly converted to a mapjoin (there is no conditional task)."), - +HIVE_CONVERT_ANTI_JOIN("hive.auto.convert.anti.join", false, Review comment: Agree with the above, I believe we should enable anti-join by default as 1) this feature should aways improve runtime 2) can help us find possible issues and further optimize existing implementation This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 461562) Time Spent: 1h 40m (was: 1.5h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461564&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461564 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 13:09 Start Date: 21/Jul/20 13:09 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r458082243 ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -2162,7 +2162,8 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Whether Hive enables the optimization about converting common join into mapjoin based on the input file size. \n" + "If this parameter is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than the\n" + "specified size, the join is directly converted to a mapjoin (there is no conditional task)."), - +HIVE_CONVERT_ANTI_JOIN("hive.auto.convert.anti.join", false, Review comment: Agree with the above, I believe we should enable anti-join by default as 1) this feature should aways improve runtime 2) can help us find possible issues and 3) further optimize existing implementation based on future scenarios This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 461564) Time Spent: 1h 50m (was: 1h 40m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461406&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461406 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 21/Jul/20 04:32 Start Date: 21/Jul/20 04:32 Worklog Time Spent: 10m Work Description: jcamachor commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r457829820 ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -2162,7 +2162,8 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Whether Hive enables the optimization about converting common join into mapjoin based on the input file size. \n" + "If this parameter is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than the\n" + "specified size, the join is directly converted to a mapjoin (there is no conditional task)."), - +HIVE_CONVERT_ANTI_JOIN("hive.auto.convert.anti.join", false, Review comment: Is there any reason why we should not enable this by default in master? It seems it is always beneficial to execute the antijoin since we already have a vectorized implementation too. That would increase the test coverage for the feature. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 461406) Time Spent: 1.5h (was: 1h 20m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461160&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461160 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 20/Jul/20 16:37 Start Date: 20/Jul/20 16:37 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r457545081 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java ## @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.rules; + +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptUtil; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.core.Filter; +import org.apache.calcite.rel.core.Join; +import org.apache.calcite.rel.core.JoinRelType; +import org.apache.calcite.rel.core.Project; +import org.apache.calcite.rel.type.RelDataTypeField; +import org.apache.calcite.rex.RexInputRef; +import org.apache.calcite.rex.RexNode; +import org.apache.calcite.sql.SqlKind; +import org.apache.calcite.util.ImmutableBitSet; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; + +/** + * Planner rule that converts a join plus filter to anti join. + */ +public class HiveJoinWithFilterToAntiJoinRule extends RelOptRule { + protected static final Logger LOG = LoggerFactory.getLogger(HiveJoinWithFilterToAntiJoinRule.class); + public static final HiveJoinWithFilterToAntiJoinRule INSTANCE = new HiveJoinWithFilterToAntiJoinRule(); + + //HiveProject(fld=[$0]) + // HiveFilter(condition=[IS NULL($1)]) + //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], cost=[not available]) + // + // TO + // + //HiveProject(fld_tbl=[$0]) + // HiveAntiJoin(condition=[=($0, $1)], joinType=[anti]) + // + public HiveJoinWithFilterToAntiJoinRule() { +super(operand(Project.class, operand(Filter.class, operand(Join.class, RelOptRule.any(, +"HiveJoinWithFilterToAntiJoinRule:filter"); + } + + // is null filter over a left join. + public void onMatch(final RelOptRuleCall call) { +final Project project = call.rel(0); +final Filter filter = call.rel(1); +final Join join = call.rel(2); +perform(call, project, filter, join); + } + + protected void perform(RelOptRuleCall call, Project project, Filter filter, Join join) { +LOG.debug("Matched HiveAntiJoinRule"); Review comment: sure ..will do that This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 461160) Time Spent: 1h 20m (was: 1h 10m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows f
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461139&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461139 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 20/Jul/20 15:51 Start Date: 20/Jul/20 15:51 Worklog Time Spent: 10m Work Description: ramesh0201 commented on pull request #1147: URL: https://github.com/apache/hive/pull/1147#issuecomment-661124391 Runtime changes look good to me +1. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 461139) Time Spent: 1h 10m (was: 1h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=460872&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-460872 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 20/Jul/20 03:14 Start Date: 20/Jul/20 03:14 Worklog Time Spent: 10m Work Description: ramesh0201 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r457008983 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java ## @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.rules; + +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptUtil; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.core.Filter; +import org.apache.calcite.rel.core.Join; +import org.apache.calcite.rel.core.JoinRelType; +import org.apache.calcite.rel.core.Project; +import org.apache.calcite.rel.type.RelDataTypeField; +import org.apache.calcite.rex.RexInputRef; +import org.apache.calcite.rex.RexNode; +import org.apache.calcite.sql.SqlKind; +import org.apache.calcite.util.ImmutableBitSet; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; + +/** + * Planner rule that converts a join plus filter to anti join. + */ +public class HiveJoinWithFilterToAntiJoinRule extends RelOptRule { + protected static final Logger LOG = LoggerFactory.getLogger(HiveJoinWithFilterToAntiJoinRule.class); + public static final HiveJoinWithFilterToAntiJoinRule INSTANCE = new HiveJoinWithFilterToAntiJoinRule(); + + //HiveProject(fld=[$0]) + // HiveFilter(condition=[IS NULL($1)]) + //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], cost=[not available]) + // + // TO + // + //HiveProject(fld_tbl=[$0]) + // HiveAntiJoin(condition=[=($0, $1)], joinType=[anti]) + // + public HiveJoinWithFilterToAntiJoinRule() { +super(operand(Project.class, operand(Filter.class, operand(Join.class, RelOptRule.any(, +"HiveJoinWithFilterToAntiJoinRule:filter"); + } + + // is null filter over a left join. + public void onMatch(final RelOptRuleCall call) { +final Project project = call.rel(0); +final Filter filter = call.rel(1); +final Join join = call.rel(2); +perform(call, project, filter, join); + } + + protected void perform(RelOptRuleCall call, Project project, Filter filter, Join join) { +LOG.debug("Matched HiveAntiJoinRule"); Review comment: I think this can be moved down after all the condition checks below and return statements and within a isDebugEnabled check? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 460872) Time Spent: 1h (was: 50m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 1h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=460429&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-460429 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 17/Jul/20 18:02 Start Date: 17/Jul/20 18:02 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r456594608 ## File path: ql/src/test/results/clientpositive/llap/antijoin.q.out ## @@ -0,0 +1,1007 @@ +PREHOOK: query: create table t1_n55 as select cast(key as int) key, value from src where key <= 10 +PREHOOK: type: CREATETABLE_AS_SELECT +PREHOOK: Input: default@src +PREHOOK: Output: database:default +PREHOOK: Output: default@t1_n55 +POSTHOOK: query: create table t1_n55 as select cast(key as int) key, value from src where key <= 10 +POSTHOOK: type: CREATETABLE_AS_SELECT +POSTHOOK: Input: default@src +POSTHOOK: Output: database:default +POSTHOOK: Output: default@t1_n55 +POSTHOOK: Lineage: t1_n55.key EXPRESSION [(src)src.FieldSchema(name:key, type:string, comment:default), ] +POSTHOOK: Lineage: t1_n55.value SIMPLE [(src)src.FieldSchema(name:value, type:string, comment:default), ] +PREHOOK: query: select * from t1_n55 sort by key +PREHOOK: type: QUERY +PREHOOK: Input: default@t1_n55 + A masked pattern was here +POSTHOOK: query: select * from t1_n55 sort by key +POSTHOOK: type: QUERY +POSTHOOK: Input: default@t1_n55 + A masked pattern was here +0 val_0 Review comment: These all new test cases are added from the failure test cases of a dry run with anti join enabled true. Manually i have verified that the resultant records are same and plan difference is as per expected behavior. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 460429) Time Spent: 50m (was: 40m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 50m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=460427&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-460427 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 17/Jul/20 18:01 Start Date: 17/Jul/20 18:01 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r456593908 ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -2162,7 +2162,8 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Whether Hive enables the optimization about converting common join into mapjoin based on the input file size. \n" + "If this parameter is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than the\n" + "specified size, the join is directly converted to a mapjoin (there is no conditional task)."), - +HIVE_CONVERT_ANTI_JOIN("hive.auto.convert.anti.join", false, Review comment: Yes, i had triggered a ptest run with this config enabled to true by default. There were some 26 failures. I had analyzed those and some fixes were done to make sure that the result is same for both and difference in plan is as expected. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 460427) Time Spent: 40m (was: 0.5h) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 40m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=460416&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-460416 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 17/Jul/20 17:52 Start Date: 17/Jul/20 17:52 Worklog Time Spent: 10m Work Description: vineetgarg02 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r456588923 ## File path: ql/src/test/results/clientpositive/llap/antijoin.q.out ## @@ -0,0 +1,1007 @@ +PREHOOK: query: create table t1_n55 as select cast(key as int) key, value from src where key <= 10 +PREHOOK: type: CREATETABLE_AS_SELECT +PREHOOK: Input: default@src +PREHOOK: Output: database:default +PREHOOK: Output: default@t1_n55 +POSTHOOK: query: create table t1_n55 as select cast(key as int) key, value from src where key <= 10 +POSTHOOK: type: CREATETABLE_AS_SELECT +POSTHOOK: Input: default@src +POSTHOOK: Output: database:default +POSTHOOK: Output: default@t1_n55 +POSTHOOK: Lineage: t1_n55.key EXPRESSION [(src)src.FieldSchema(name:key, type:string, comment:default), ] +POSTHOOK: Lineage: t1_n55.value SIMPLE [(src)src.FieldSchema(name:value, type:string, comment:default), ] +PREHOOK: query: select * from t1_n55 sort by key +PREHOOK: type: QUERY +PREHOOK: Input: default@t1_n55 + A masked pattern was here +POSTHOOK: query: select * from t1_n55 sort by key +POSTHOOK: type: QUERY +POSTHOOK: Input: default@t1_n55 + A masked pattern was here +0 val_0 Review comment: How was the correctness of results verified? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 460416) Time Spent: 0.5h (was: 20m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive
[ https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=460414&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-460414 ] ASF GitHub Bot logged work on HIVE-23716: - Author: ASF GitHub Bot Created on: 17/Jul/20 17:51 Start Date: 17/Jul/20 17:51 Worklog Time Spent: 10m Work Description: vineetgarg02 commented on a change in pull request #1147: URL: https://github.com/apache/hive/pull/1147#discussion_r456588241 ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -2162,7 +2162,8 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Whether Hive enables the optimization about converting common join into mapjoin based on the input file size. \n" + "If this parameter is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than the\n" + "specified size, the join is directly converted to a mapjoin (there is no conditional task)."), - +HIVE_CONVERT_ANTI_JOIN("hive.auto.convert.anti.join", false, Review comment: @maheshk114 Have you run all the tests with this feature set to true by default? This change touches existing logic/code and we should definitely run all the existing tests with this set to TRUE. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 460414) Time Spent: 20m (was: 10m) > Support Anti Join in Hive > -- > > Key: HIVE-23716 > URL: https://issues.apache.org/jira/browse/HIVE-23716 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23716.01.patch > > Time Spent: 20m > Remaining Estimate: 0h > > Currently hive does not support Anti join. The query for anti join is > converted to left outer join and null filter on right side join key is added > to get the desired result. This is causing > # Extra computation — The left outer join projects the redundant columns > from right side. Along with that, filtering is done to remove the redundant > rows. This is can be avoided in case of anti join as anti join will project > only the required columns and rows from the left side table. > # Extra shuffle — In case of anti join the duplicate records moved to join > node can be avoided from the child node. This can reduce significant amount > of data movement if the number of distinct rows( join keys) is significant. > # Extra Memory Usage - In case of map based anti join , hash set is > sufficient as just the key is required to check if the records matches the > join condition. In case of left join, we need the key and the non key columns > also and thus a hash table will be required. > For a query like > {code:java} > select wr_order_number FROM web_returns LEFT JOIN web_sales ON > wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} > The number of distinct ws_order_number in web_sales table in a typical 10TB > TPCDS set up is just 10% of total records. So when we convert this query to > anti join, instead of 7 billion rows, only 600 million rows are moved to join > node. > In the current patch, just one conversion is done. The pattern of > project->filter->left-join is converted to project->anti-join. This will take > care of sub queries with “not exists” clause. The queries with “not exists” > are converted first to filter + left-join and then its converted to anti > join. The queries with “not in” are not handled in the current patch. > From execution side, both merge join and map join with vectorized execution > is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)