[jira] [Commented] (HIVE-13693) Multi-insert query drops Filter before file output when there is a.val <> b.val

2020-10-21 Thread zhaolong (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218703#comment-17218703
 ] 

zhaolong commented on HIVE-13693:
-

[~jcamachorodriguez] hello  I found this patch may cause a semi join to left 
outer join?  NOT EXISTS query result when cbo disabled is not correct. 

the left is 3.1.0 branch,  the right is 1.2.1 branch, and query like select ... 
from a,b,c,d .. where a.xx=b.xx and c.a='' and c.b='' NOT EXIST(select 1 from d 
where d.xx='')。 the semi join are convert to left join, and some filter 
condition are missed

!image-2020-10-22-10-46-04-195.png!

> Multi-insert query drops Filter before file output when there is a.val <> 
> b.val
> ---
>
> Key: HIVE-13693
> URL: https://issues.apache.org/jira/browse/HIVE-13693
> Project: Hive
>  Issue Type: Bug
>  Components: Logical Optimizer
>Affects Versions: 1.3.0, 2.0.0, 2.1.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
> Fix For: 2.1.0
>
> Attachments: HIVE-13693.01.patch, HIVE-13693.01.patch, 
> HIVE-13693.02.patch, HIVE-13693.patch
>
>
> To reproduce:
> {noformat}
> CREATE TABLE T_A ( id STRING, val STRING ); 
> CREATE TABLE T_B ( id STRING, val STRING ); 
> CREATE TABLE join_result_1 ( ida STRING, vala STRING, idb STRING, valb STRING 
> ); 
> CREATE TABLE join_result_3 ( ida STRING, vala STRING, idb STRING, valb STRING 
> ); 
> INSERT INTO TABLE T_A 
> VALUES ('Id_1', 'val_101'), ('Id_2', 'val_102'), ('Id_3', 'val_103'); 
> INSERT INTO TABLE T_B 
> VALUES ('Id_1', 'val_103'), ('Id_2', 'val_104'); 
> explain
> FROM T_A a LEFT JOIN T_B b ON a.id = b.id
> INSERT OVERWRITE TABLE join_result_1
> SELECT a.*, b.*
> WHERE b.id = 'Id_1' AND b.val = 'val_103'
> INSERT OVERWRITE TABLE join_result_3
> SELECT a.*, b.*
> WHERE b.val = 'val_104' AND b.id = 'Id_2' AND a.val <> b.val;
> {noformat}
> The (wrong) plan is the following:
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-3 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-3
>   Stage-4 depends on stages: Stage-0
>   Stage-1 depends on stages: Stage-3
>   Stage-5 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Tez
>   DagId: haha_20160504140944_174465c9-5d1a-42f9-9665-fae02eeb2767:2
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)
>   DagName: 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: a
>   Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE 
> Column stats: NONE
>   Reduce Output Operator
> key expressions: id (type: string)
> sort order: +
> Map-reduce partition columns: id (type: string)
> Statistics: Num rows: 3 Data size: 36 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: val (type: string)
> Map 3 
> Map Operator Tree:
> TableScan
>   alias: b
>   Statistics: Num rows: 2 Data size: 24 Basic stats: COMPLETE 
> Column stats: NONE
>   Reduce Output Operator
> key expressions: id (type: string)
> sort order: +
> Map-reduce partition columns: id (type: string)
> Statistics: Num rows: 2 Data size: 24 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: val (type: string)
> Reducer 2 
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Left Outer Join0 to 1
> keys:
>   0 id (type: string)
>   1 id (type: string)
> outputColumnNames: _col0, _col1, _col6
> Statistics: Num rows: 3 Data size: 39 Basic stats: COMPLETE 
> Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string), _col1 (type: string), 
> 'Id_1' (type: string), 'val_103' (type: string)
>   outputColumnNames: _col0, _col1, _col2, _col3
>   Statistics: Num rows: 3 Data size: 39 Basic stats: COMPLETE 
> Column stats: NONE
>   File Output Operator
> compressed: false
> Statistics: Num rows: 3 Data size: 39 Basic stats: 
> COMPLETE Column stats: NONE
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> serde: 
> 

[jira] [Commented] (HIVE-13693) Multi-insert query drops Filter before file output when there is a.val <> b.val

2016-05-16 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284790#comment-15284790
 ] 

Hive QA commented on HIVE-13693:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12804133/HIVE-13693.01.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 24 failed/errored test(s), 9998 tests 
executed
*Failed tests:*
{noformat}
TestHWISessionManager - did not produce a TEST-*.xml file
TestMiniLlapCliDriver - did not produce a TEST-*.xml file
TestMiniTezCliDriver-enforce_order.q-vector_partition_diff_num_cols.q-unionDistinct_1.q-and-12-more
 - did not produce a TEST-*.xml file
TestMiniTezCliDriver-explainuser_4.q-update_after_multiple_inserts.q-mapreduce2.q-and-12-more
 - did not produce a TEST-*.xml file
TestMiniTezCliDriver-groupby2.q-tez_dynpart_hashjoin_1.q-custom_input_output_format.q-and-12-more
 - did not produce a TEST-*.xml file
TestMiniTezCliDriver-join1.q-mapjoin_decimal.q-union5.q-and-12-more - did not 
produce a TEST-*.xml file
TestMiniTezCliDriver-mapjoin_mapjoin.q-insert_into1.q-vector_decimal_2.q-and-12-more
 - did not produce a TEST-*.xml file
TestMiniTezCliDriver-smb_cache.q-transform_ppr2.q-vector_outer_join0.q-and-5-more
 - did not produce a TEST-*.xml file
TestSparkCliDriver-ppd_transform.q-union_remove_7.q-date_udf.q-and-12-more - 
did not produce a TEST-*.xml file
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ivyDownload
org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver_external_table_ppd
org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver_hbase_ppd_key_range
org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver_hbase_pushdown
org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver_hbase_timestamp
org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver_ppd_key_ranges
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_index_bitmap3
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_multi_single_reducer
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_multi_insert_gby
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_subquery_multiinsert
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_udf_min
org.apache.hadoop.hive.llap.tez.TestConverters.testFragmentSpecToTaskSpec
org.apache.hadoop.hive.llap.tezplugins.TestLlapTaskCommunicator.testFinishableStateUpdateFailure
org.apache.hadoop.hive.llap.tezplugins.TestLlapTaskSchedulerService.testDelayedLocalityNodeCommErrorImmediateAllocation
org.apache.hive.service.cli.session.TestHiveSessionImpl.testLeakOperationHandle
{noformat}

Test results: 
http://ec2-54-177-240-2.us-west-1.compute.amazonaws.com/job/PreCommit-HIVE-MASTER-Build/298/testReport
Console output: 
http://ec2-54-177-240-2.us-west-1.compute.amazonaws.com/job/PreCommit-HIVE-MASTER-Build/298/console
Test logs: 
http://ec2-50-18-27-0.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-MASTER-Build-298/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 24 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12804133 - PreCommit-HIVE-MASTER-Build

> Multi-insert query drops Filter before file output when there is a.val <> 
> b.val
> ---
>
> Key: HIVE-13693
> URL: https://issues.apache.org/jira/browse/HIVE-13693
> Project: Hive
>  Issue Type: Bug
>  Components: Logical Optimizer
>Affects Versions: 1.3.0, 2.0.0, 2.1.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
> Attachments: HIVE-13693.01.patch, HIVE-13693.01.patch, 
> HIVE-13693.patch
>
>
> To reproduce:
> {noformat}
> CREATE TABLE T_A ( id STRING, val STRING ); 
> CREATE TABLE T_B ( id STRING, val STRING ); 
> CREATE TABLE join_result_1 ( ida STRING, vala STRING, idb STRING, valb STRING 
> ); 
> CREATE TABLE join_result_3 ( ida STRING, vala STRING, idb STRING, valb STRING 
> ); 
> INSERT INTO TABLE T_A 
> VALUES ('Id_1', 'val_101'), ('Id_2', 'val_102'), ('Id_3', 'val_103'); 
> INSERT INTO TABLE T_B 
> VALUES ('Id_1', 'val_103'), ('Id_2', 'val_104'); 
> explain
> FROM T_A a LEFT JOIN T_B b ON a.id = b.id
> INSERT OVERWRITE TABLE join_result_1
> SELECT a.*, b.*
> WHERE b.id = 'Id_1' AND b.val = 'val_103'
> INSERT OVERWRITE TABLE join_result_3
> SELECT a.*, b.*
> WHERE b.val = 'val_104' AND b.id = 'Id_2' AND a.val <> b.val;
> {noformat}
> The (wrong) plan is the following:
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-3 depends on stages: 

[jira] [Commented] (HIVE-13693) Multi-insert query drops Filter before file output when there is a.val <> b.val

2016-05-07 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275235#comment-15275235
 ] 

Hive QA commented on HIVE-13693:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12802464/HIVE-13693.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 77 failed/errored test(s), 9939 tests 
executed
*Failed tests:*
{noformat}
TestCliDriver-gen_udf_example_add10.q-ppd_join4.q-union27.q-and-12-more - did 
not produce a TEST-*.xml file
TestCliDriver-partition_timestamp.q-ppd_random.q-vector_outer_join5.q-and-12-more
 - did not produce a TEST-*.xml file
TestCliDriver-ptf_general_queries.q-unionDistinct_1.q-groupby1_noskew.q-and-12-more
 - did not produce a TEST-*.xml file
TestHWISessionManager - did not produce a TEST-*.xml file
TestMiniLlapCliDriver - did not produce a TEST-*.xml file
TestMiniTezCliDriver-enforce_order.q-vector_partition_diff_num_cols.q-unionDistinct_1.q-and-12-more
 - did not produce a TEST-*.xml file
TestMiniTezCliDriver-insert_values_non_partitioned.q-join1.q-schema_evol_orc_nonvec_mapwork_part.q-and-12-more
 - did not produce a TEST-*.xml file
TestSparkCliDriver-bucketsortoptimize_insert_7.q-smb_mapjoin_15.q-mapreduce1.q-and-12-more
 - did not produce a TEST-*.xml file
TestSparkCliDriver-skewjoinopt3.q-union27.q-multigroupby_singlemr.q-and-12-more 
- did not produce a TEST-*.xml file
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_allcolref_in_udf
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cluster
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_multi_single_reducer
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join26
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join38
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join_map_ppr
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join_reorder2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join_reorder3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_masking_disablecbo_1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_masking_disablecbo_2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_masking_disablecbo_3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_multi_insert_gby
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_optimize_nullscan
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_clusterby
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_semijoin
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_semijoin2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_smb_mapjoin9
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_smb_mapjoin_14
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_smb_mapjoin_6
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sort_merge_join_desc_1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sort_merge_join_desc_2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sort_merge_join_desc_3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sort_merge_join_desc_4
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sort_merge_join_desc_8
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_multiinsert
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vector_leftsemi_mapjoin
org.apache.hadoop.hive.cli.TestContribNegativeCliDriver.testNegativeCliDriver_case_with_row_sequence
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_index_bitmap3
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_dynamic_partition_pruning
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_explainuser_1
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_explainuser_2
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_hybridgrace_hashjoin_2
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_llap_nullscan
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_skewjoin
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_tez_union_group_by
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_leftsemi_mapjoin
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_mapjoin_reduce
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vectorized_dynamic_partition_pruning
org.apache.hadoop.hive.cli.TestPerfCliDriver.testPerfCliDriver_query12
org.apache.hadoop.hive.cli.TestPerfCliDriver.testPerfCliDriver_query17
org.apache.hadoop.hive.cli.TestPerfCliDriver.testPerfCliDriver_query18
org.apache.hadoop.hive.cli.TestPerfCliDriver.testPerfCliDriver_query19
org.apache.hadoop.hive.cli.TestPerfCliDriver.testPerfCliDriver_query25
org.apache.hadoop.hive.cli.TestPerfCliDriver.testPerfCliDriver_query29

[jira] [Commented] (HIVE-13693) Multi-insert query drops Filter before file output when there is a.val <> b.val

2016-05-05 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272943#comment-15272943
 ] 

Ashutosh Chauhan commented on HIVE-13693:
-

+1 pending tests

> Multi-insert query drops Filter before file output when there is a.val <> 
> b.val
> ---
>
> Key: HIVE-13693
> URL: https://issues.apache.org/jira/browse/HIVE-13693
> Project: Hive
>  Issue Type: Bug
>  Components: Logical Optimizer
>Affects Versions: 1.3.0, 2.0.0, 2.1.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
> Attachments: HIVE-13693.patch
>
>
> To reproduce:
> {noformat}
> CREATE TABLE T_A ( id STRING, val STRING ); 
> CREATE TABLE T_B ( id STRING, val STRING ); 
> CREATE TABLE join_result_1 ( ida STRING, vala STRING, idb STRING, valb STRING 
> ); 
> CREATE TABLE join_result_3 ( ida STRING, vala STRING, idb STRING, valb STRING 
> ); 
> INSERT INTO TABLE T_A 
> VALUES ('Id_1', 'val_101'), ('Id_2', 'val_102'), ('Id_3', 'val_103'); 
> INSERT INTO TABLE T_B 
> VALUES ('Id_1', 'val_103'), ('Id_2', 'val_104'); 
> explain
> FROM T_A a LEFT JOIN T_B b ON a.id = b.id
> INSERT OVERWRITE TABLE join_result_1
> SELECT a.*, b.*
> WHERE b.id = 'Id_1' AND b.val = 'val_103'
> INSERT OVERWRITE TABLE join_result_3
> SELECT a.*, b.*
> WHERE b.val = 'val_104' AND b.id = 'Id_2' AND a.val <> b.val;
> {noformat}
> The (wrong) plan is the following:
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-3 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-3
>   Stage-4 depends on stages: Stage-0
>   Stage-1 depends on stages: Stage-3
>   Stage-5 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Tez
>   DagId: haha_20160504140944_174465c9-5d1a-42f9-9665-fae02eeb2767:2
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)
>   DagName: 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: a
>   Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE 
> Column stats: NONE
>   Reduce Output Operator
> key expressions: id (type: string)
> sort order: +
> Map-reduce partition columns: id (type: string)
> Statistics: Num rows: 3 Data size: 36 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: val (type: string)
> Map 3 
> Map Operator Tree:
> TableScan
>   alias: b
>   Statistics: Num rows: 2 Data size: 24 Basic stats: COMPLETE 
> Column stats: NONE
>   Reduce Output Operator
> key expressions: id (type: string)
> sort order: +
> Map-reduce partition columns: id (type: string)
> Statistics: Num rows: 2 Data size: 24 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: val (type: string)
> Reducer 2 
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Left Outer Join0 to 1
> keys:
>   0 id (type: string)
>   1 id (type: string)
> outputColumnNames: _col0, _col1, _col6
> Statistics: Num rows: 3 Data size: 39 Basic stats: COMPLETE 
> Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string), _col1 (type: string), 
> 'Id_1' (type: string), 'val_103' (type: string)
>   outputColumnNames: _col0, _col1, _col2, _col3
>   Statistics: Num rows: 3 Data size: 39 Basic stats: COMPLETE 
> Column stats: NONE
>   File Output Operator
> compressed: false
> Statistics: Num rows: 3 Data size: 39 Basic stats: 
> COMPLETE Column stats: NONE
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> name: bugtest2.join_result_1
> Filter Operator
>   predicate: (_col1 <> _col6) (type: boolean)
>   Statistics: Num rows: 3 Data size: 39 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: _col0 (type: string), _col1 (type: string), 
> 'Id_2' (type: string), 'val_104' (type: string)
> outputColumnNames: _col0, _col1, _col2, _col3
>  

[jira] [Commented] (HIVE-13693) Multi-insert query drops Filter before file output when there is a.val <> b.val

2016-05-05 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272898#comment-15272898
 ] 

Ashutosh Chauhan commented on HIVE-13693:
-

Thanks Jesus for detailed explanation. Can you create a RB for this? 

> Multi-insert query drops Filter before file output when there is a.val <> 
> b.val
> ---
>
> Key: HIVE-13693
> URL: https://issues.apache.org/jira/browse/HIVE-13693
> Project: Hive
>  Issue Type: Bug
>  Components: Logical Optimizer
>Affects Versions: 1.3.0, 2.0.0, 2.1.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
> Attachments: HIVE-13693.patch
>
>
> To reproduce:
> {noformat}
> CREATE TABLE T_A ( id STRING, val STRING ); 
> CREATE TABLE T_B ( id STRING, val STRING ); 
> CREATE TABLE join_result_1 ( ida STRING, vala STRING, idb STRING, valb STRING 
> ); 
> CREATE TABLE join_result_3 ( ida STRING, vala STRING, idb STRING, valb STRING 
> ); 
> INSERT INTO TABLE T_A 
> VALUES ('Id_1', 'val_101'), ('Id_2', 'val_102'), ('Id_3', 'val_103'); 
> INSERT INTO TABLE T_B 
> VALUES ('Id_1', 'val_103'), ('Id_2', 'val_104'); 
> explain
> FROM T_A a LEFT JOIN T_B b ON a.id = b.id
> INSERT OVERWRITE TABLE join_result_1
> SELECT a.*, b.*
> WHERE b.id = 'Id_1' AND b.val = 'val_103'
> INSERT OVERWRITE TABLE join_result_3
> SELECT a.*, b.*
> WHERE b.val = 'val_104' AND b.id = 'Id_2' AND a.val <> b.val;
> {noformat}
> The (wrong) plan is the following:
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-3 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-3
>   Stage-4 depends on stages: Stage-0
>   Stage-1 depends on stages: Stage-3
>   Stage-5 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Tez
>   DagId: haha_20160504140944_174465c9-5d1a-42f9-9665-fae02eeb2767:2
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)
>   DagName: 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: a
>   Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE 
> Column stats: NONE
>   Reduce Output Operator
> key expressions: id (type: string)
> sort order: +
> Map-reduce partition columns: id (type: string)
> Statistics: Num rows: 3 Data size: 36 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: val (type: string)
> Map 3 
> Map Operator Tree:
> TableScan
>   alias: b
>   Statistics: Num rows: 2 Data size: 24 Basic stats: COMPLETE 
> Column stats: NONE
>   Reduce Output Operator
> key expressions: id (type: string)
> sort order: +
> Map-reduce partition columns: id (type: string)
> Statistics: Num rows: 2 Data size: 24 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: val (type: string)
> Reducer 2 
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Left Outer Join0 to 1
> keys:
>   0 id (type: string)
>   1 id (type: string)
> outputColumnNames: _col0, _col1, _col6
> Statistics: Num rows: 3 Data size: 39 Basic stats: COMPLETE 
> Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string), _col1 (type: string), 
> 'Id_1' (type: string), 'val_103' (type: string)
>   outputColumnNames: _col0, _col1, _col2, _col3
>   Statistics: Num rows: 3 Data size: 39 Basic stats: COMPLETE 
> Column stats: NONE
>   File Output Operator
> compressed: false
> Statistics: Num rows: 3 Data size: 39 Basic stats: 
> COMPLETE Column stats: NONE
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> name: bugtest2.join_result_1
> Filter Operator
>   predicate: (_col1 <> _col6) (type: boolean)
>   Statistics: Num rows: 3 Data size: 39 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: _col0 (type: string), _col1 (type: string), 
> 'Id_2' (type: string), 'val_104' (type: string)
> 

[jira] [Commented] (HIVE-13693) Multi-insert query drops Filter before file output when there is a.val <> b.val

2016-05-05 Thread Jesus Camacho Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272699#comment-15272699
 ] 

Jesus Camacho Rodriguez commented on HIVE-13693:


Problem is in PPD and it only happens for multi-insert.

HIVE-9695 introduced partial removal of predicates in Filter operators. It was 
a fairly simple patch: given a Filter operator, a new Filter will replace the 
original one with the parts of the condition that cannot be pushed down, while 
the other parts of the condition will be pushed.
The patch relied on createFilter for the logic of Filter creation, that would 
replace/remove the original Filter immediately after the predicate was pushed.

The problem comes with operators with multiple outputs i.e. multi-insert, for 
which PPD does not pushed predicates through. In particular, when we are 
executing PPD and we detect an operator with multiple outputs, we just jump out 
to its input and thus we do not continue pushing the predicates. However, due 
to the logic introduced in HIVE-9695, the original Filter already got removed...

Attached a fix that recreates the Filter operator(s) if necessary on top of the 
multi-output operator.

> Multi-insert query drops Filter before file output when there is a.val <> 
> b.val
> ---
>
> Key: HIVE-13693
> URL: https://issues.apache.org/jira/browse/HIVE-13693
> Project: Hive
>  Issue Type: Bug
>  Components: Logical Optimizer
>Affects Versions: 1.3.0, 2.0.0, 2.1.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>
> To reproduce:
> {noformat}
> CREATE TABLE T_A ( id STRING, val STRING ); 
> CREATE TABLE T_B ( id STRING, val STRING ); 
> CREATE TABLE join_result_1 ( ida STRING, vala STRING, idb STRING, valb STRING 
> ); 
> CREATE TABLE join_result_3 ( ida STRING, vala STRING, idb STRING, valb STRING 
> ); 
> INSERT INTO TABLE T_A 
> VALUES ('Id_1', 'val_101'), ('Id_2', 'val_102'), ('Id_3', 'val_103'); 
> INSERT INTO TABLE T_B 
> VALUES ('Id_1', 'val_103'), ('Id_2', 'val_104'); 
> explain
> FROM T_A a LEFT JOIN T_B b ON a.id = b.id
> INSERT OVERWRITE TABLE join_result_1
> SELECT a.*, b.*
> WHERE b.id = 'Id_1' AND b.val = 'val_103'
> INSERT OVERWRITE TABLE join_result_3
> SELECT a.*, b.*
> WHERE b.val = 'val_104' AND b.id = 'Id_2' AND a.val <> b.val;
> {noformat}
> The (wrong) plan is the following:
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-3 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-3
>   Stage-4 depends on stages: Stage-0
>   Stage-1 depends on stages: Stage-3
>   Stage-5 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Tez
>   DagId: haha_20160504140944_174465c9-5d1a-42f9-9665-fae02eeb2767:2
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)
>   DagName: 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: a
>   Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE 
> Column stats: NONE
>   Reduce Output Operator
> key expressions: id (type: string)
> sort order: +
> Map-reduce partition columns: id (type: string)
> Statistics: Num rows: 3 Data size: 36 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: val (type: string)
> Map 3 
> Map Operator Tree:
> TableScan
>   alias: b
>   Statistics: Num rows: 2 Data size: 24 Basic stats: COMPLETE 
> Column stats: NONE
>   Reduce Output Operator
> key expressions: id (type: string)
> sort order: +
> Map-reduce partition columns: id (type: string)
> Statistics: Num rows: 2 Data size: 24 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: val (type: string)
> Reducer 2 
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Left Outer Join0 to 1
> keys:
>   0 id (type: string)
>   1 id (type: string)
> outputColumnNames: _col0, _col1, _col6
> Statistics: Num rows: 3 Data size: 39 Basic stats: COMPLETE 
> Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string), _col1 (type: string), 
> 'Id_1' (type: string), 'val_103' (type: string)
>   outputColumnNames: _col0, _col1, _col2, _col3
>   Statistics: Num rows: 3 Data size: 39 Basic stats: COMPLETE 
> Column stats: NONE
>