[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

2016-01-19 Thread Chengxiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-12736:
-
Attachment: HIVE-12736.5-spark.patch

[~xuefuz], Yes, it's related, i miss something here. Group By before MapJoin is 
not allowed, and in MR mode, it use {{ReduceSinkOperator}} to check whether 
there is Group By before MapJoin, it has conflict with Spark mode, as mentioned 
before. Instead of validate MapJoin compatibility with other Operators by 
through {{opAllowedBeforeMapJoin()}} and {{opAllowedAfterMapJoin()}}, i should 
be easier and proper to implement through pattern match, i didn't rewrite the 
validation for MR mode, just add new validation logic for Spark mode based on 
pattern match.

> It seems that result of Hive on Spark be mistaken and result of Hive and Hive 
> on Spark are not the same
> ---
>
> Key: HIVE-12736
> URL: https://issues.apache.org/jira/browse/HIVE-12736
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Chengxiang Li
> Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch, 
> HIVE-12736.3-spark.patch, HIVE-12736.4-spark.patch, HIVE-12736.5-spark.patch
>
>
> {code}
> select  * from staff;
> 1 jone22  1
> 2 lucy21  1
> 3 hmm 22  2
> 4 james   24  3
> 5 xiaoliu 23  3
> select id,date_ from trade union all select id,"test" from trade ;
> 1 201510210908
> 2 201509080234
> 2 201509080235
> 1 test
> 2 test
> 2 test
> set hive.execution.engine=spark;
> set spark.master=local;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> 1 jone22  1   1   201510210908
> 2 lucy21  1   2   201509080234
> 2 lucy21  1   2   201509080235
> set hive.execution.engine=mr;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
> mapjoin hint. Please remove mapjoin hint.
> {code}
> I have two questions
> 1.Why result of hive on spark not include the following record?
> {code}
> 1 jone22  1   1   test
> 2 lucy21  1   2   test
> 2 lucy21  1   2   test
> {code}
> 2.Why there are two different ways of dealing same query?
> explain 1:
> {code}
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select id,date_ from trade union all select id,"test" from trade;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), date_ (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Map 2 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), 'test' (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> 

[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

2016-01-19 Thread Chengxiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-12736:
-
Attachment: HIVE-12736.5-spark.patch

I can't reproduce the failed mapjoin_memcheck.q locally, upload the patch again 
to verify.

> It seems that result of Hive on Spark be mistaken and result of Hive and Hive 
> on Spark are not the same
> ---
>
> Key: HIVE-12736
> URL: https://issues.apache.org/jira/browse/HIVE-12736
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Chengxiang Li
> Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch, 
> HIVE-12736.3-spark.patch, HIVE-12736.4-spark.patch, HIVE-12736.5-spark.patch, 
> HIVE-12736.5-spark.patch
>
>
> {code}
> select  * from staff;
> 1 jone22  1
> 2 lucy21  1
> 3 hmm 22  2
> 4 james   24  3
> 5 xiaoliu 23  3
> select id,date_ from trade union all select id,"test" from trade ;
> 1 201510210908
> 2 201509080234
> 2 201509080235
> 1 test
> 2 test
> 2 test
> set hive.execution.engine=spark;
> set spark.master=local;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> 1 jone22  1   1   201510210908
> 2 lucy21  1   2   201509080234
> 2 lucy21  1   2   201509080235
> set hive.execution.engine=mr;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
> mapjoin hint. Please remove mapjoin hint.
> {code}
> I have two questions
> 1.Why result of hive on spark not include the following record?
> {code}
> 1 jone22  1   1   test
> 2 lucy21  1   2   test
> 2 lucy21  1   2   test
> {code}
> 2.Why there are two different ways of dealing same query?
> explain 1:
> {code}
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select id,date_ from trade union all select id,"test" from trade;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), date_ (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Map 2 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), 'test' (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> ListSink
> {code}
> explain 2:
> {code}
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;

[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

2016-01-18 Thread Chengxiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-12736:
-
Attachment: HIVE-12736.3-spark.patch

Yes, [~xuefuz], {{Operator::opAllowedBeforeMapJoin()}} and 
{{Operator::opAllowedAfterMapJoin()}} are only used for 
{{MapJoinProcessor::validateMapJoinTypes()}}, For MR mode, if there are 
{{ReduceSinkOperator}} before {{MapJoinOperator}}, the {{ReduceSinkOperator}} 
would be removed from the operator tree, so 
{{ReduceSinkOperator::opAllowedBeforeMapJoin()}} would never be accessed in MR 
mode. For Spark mode, only one of two {{ReduceSinkOperator}}s before 
{{MapJoinOperator}} would be removed, if 
{{ReduceSinkOperator::opAllowedBeforeMapJoin()}} return false, all the mapjoin 
with hint would be failed in Spark mode, it actually does not make sense, it 
should only fail while it's {{UnionOperator}} before {{MapJoinOperator}}. So 
the change does not influence MR mode, and it's required by Spark mode.
Besides, i add negative test for mapjoin with hint.

> It seems that result of Hive on Spark be mistaken and result of Hive and Hive 
> on Spark are not the same
> ---
>
> Key: HIVE-12736
> URL: https://issues.apache.org/jira/browse/HIVE-12736
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Chengxiang Li
> Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch, 
> HIVE-12736.3-spark.patch
>
>
> {code}
> select  * from staff;
> 1 jone22  1
> 2 lucy21  1
> 3 hmm 22  2
> 4 james   24  3
> 5 xiaoliu 23  3
> select id,date_ from trade union all select id,"test" from trade ;
> 1 201510210908
> 2 201509080234
> 2 201509080235
> 1 test
> 2 test
> 2 test
> set hive.execution.engine=spark;
> set spark.master=local;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> 1 jone22  1   1   201510210908
> 2 lucy21  1   2   201509080234
> 2 lucy21  1   2   201509080235
> set hive.execution.engine=mr;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
> mapjoin hint. Please remove mapjoin hint.
> {code}
> I have two questions
> 1.Why result of hive on spark not include the following record?
> {code}
> 1 jone22  1   1   test
> 2 lucy21  1   2   test
> 2 lucy21  1   2   test
> {code}
> 2.Why there are two different ways of dealing same query?
> explain 1:
> {code}
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select id,date_ from trade union all select id,"test" from trade;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), date_ (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Map 2 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), 'test' (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 

[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

2016-01-18 Thread Chengxiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-12736:
-
Attachment: HIVE-12736.4-spark.patch

> It seems that result of Hive on Spark be mistaken and result of Hive and Hive 
> on Spark are not the same
> ---
>
> Key: HIVE-12736
> URL: https://issues.apache.org/jira/browse/HIVE-12736
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Chengxiang Li
> Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch, 
> HIVE-12736.3-spark.patch, HIVE-12736.4-spark.patch
>
>
> {code}
> select  * from staff;
> 1 jone22  1
> 2 lucy21  1
> 3 hmm 22  2
> 4 james   24  3
> 5 xiaoliu 23  3
> select id,date_ from trade union all select id,"test" from trade ;
> 1 201510210908
> 2 201509080234
> 2 201509080235
> 1 test
> 2 test
> 2 test
> set hive.execution.engine=spark;
> set spark.master=local;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> 1 jone22  1   1   201510210908
> 2 lucy21  1   2   201509080234
> 2 lucy21  1   2   201509080235
> set hive.execution.engine=mr;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
> mapjoin hint. Please remove mapjoin hint.
> {code}
> I have two questions
> 1.Why result of hive on spark not include the following record?
> {code}
> 1 jone22  1   1   test
> 2 lucy21  1   2   test
> 2 lucy21  1   2   test
> {code}
> 2.Why there are two different ways of dealing same query?
> explain 1:
> {code}
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select id,date_ from trade union all select id,"test" from trade;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), date_ (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Map 2 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), 'test' (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> ListSink
> {code}
> explain 2:
> {code}
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> OK
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:

[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

2016-01-17 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-12736:
---
Description: 
{code}
select  * from staff;
1   jone22  1
2   lucy21  1
3   hmm 22  2
4   james   24  3
5   xiaoliu 23  3

select id,date_ from trade union all select id,"test" from trade ;
1   201510210908
2   201509080234
2   201509080235
1   test
2   test
2   test

set hive.execution.engine=spark;
set spark.master=local;
select /*+mapjoin(t)*/ * from staff s join 
(select id,date_ from trade union all select id,"test" from trade ) t on 
s.id=t.id;
1   jone22  1   1   201510210908
2   lucy21  1   2   201509080234
2   lucy21  1   2   201509080235

set hive.execution.engine=mr;
select /*+mapjoin(t)*/ * from staff s join 
(select id,date_ from trade union all select id,"test" from trade ) t on 
s.id=t.id;
FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
mapjoin hint. Please remove mapjoin hint.
{code}
I have two questions
1.Why result of hive on spark not include the following record?
{code}
1   jone22  1   1   test
2   lucy21  1   2   test
2   lucy21  1   2   test
{code}
2.Why there are two different ways of dealing same query?

explain 1:
{code}
set hive.execution.engine=spark;
set spark.master=local;
explain 
select id,date_ from trade union all select id,"test" from trade;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Spark
  DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
  Vertices:
Map 1 
Map Operator Tree:
TableScan
  alias: trade
  Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
Column stats: NONE
  Select Operator
expressions: id (type: int), date_ (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 12 Data size: 96 Basic stats: 
COMPLETE Column stats: NONE
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Map 2 
Map Operator Tree:
TableScan
  alias: trade
  Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
Column stats: NONE
  Select Operator
expressions: id (type: int), 'test' (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 12 Data size: 96 Basic stats: 
COMPLETE Column stats: NONE
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink
{code}
explain 2:
{code}
set hive.execution.engine=spark;
set spark.master=local;
explain 
select /*+mapjoin(t)*/ * from staff s join 
(select id,date_ from trade union all select id,"test" from trade ) t on 
s.id=t.id;
OK
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
Spark
  DagName: jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:3
  Vertices:
Map 1 
Map Operator Tree:
TableScan
  alias: trade
  Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
Column stats: NONE
  Filter Operator
predicate: id is not null (type: boolean)
Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE 
Column stats: NONE
Select Operator
  expressions: id (type: int), date_ (type: string)
  outputColumnNames: _col0, _col1
  Statistics: Num rows: 3 Data size: 24 Basic stats: 
COMPLETE Column stats: NONE
  Spark HashTable Sink Operator

[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

2016-01-12 Thread Chengxiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-12736:
-
Attachment: HIVE-12736.2-spark.patch

> It seems that result of Hive on Spark be mistaken and result of Hive and Hive 
> on Spark are not the same
> ---
>
> Key: HIVE-12736
> URL: https://issues.apache.org/jira/browse/HIVE-12736
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Chengxiang Li
> Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch
>
>
> select  * from staff;
> 1 jone22  1
> 2 lucy21  1
> 3 hmm 22  2
> 4 james   24  3
> 5 xiaoliu 23  3
> select id,date_ from trade union all select id,"test" from trade ;
> 1 201510210908
> 2 201509080234
> 2 201509080235
> 1 test
> 2 test
> 2 test
> set hive.execution.engine=spark;
> set spark.master=local;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> 1 jone22  1   1   201510210908
> 2 lucy21  1   2   201509080234
> 2 lucy21  1   2   201509080235
> set hive.execution.engine=mr;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
> mapjoin hint. Please remove mapjoin hint.
> I have two questions
> 1.Why result of hive on spark not include the following record?
> 1 jone22  1   1   test
> 2 lucy21  1   2   test
> 2 lucy21  1   2   test
> 2.Why there are two different ways of dealing same query?
> explain 1:
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select id,date_ from trade union all select id,"test" from trade;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), date_ (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Map 2 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), 'test' (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> ListSink
> explain 2:
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> OK
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   DagName: jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:3
>  

[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

2016-01-11 Thread Chengxiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-12736:
-
Attachment: HIVE-12736.1-spark.patch

{{SparkMapJoinProcessor}} miss some validation during {{convertMapJoin}}, for 
Spark mode, the query should work the same way as MR.

> It seems that result of Hive on Spark be mistaken and result of Hive and Hive 
> on Spark are not the same
> ---
>
> Key: HIVE-12736
> URL: https://issues.apache.org/jira/browse/HIVE-12736
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Chengxiang Li
> Attachments: HIVE-12736.1-spark.patch
>
>
> select  * from staff;
> 1 jone22  1
> 2 lucy21  1
> 3 hmm 22  2
> 4 james   24  3
> 5 xiaoliu 23  3
> select id,date_ from trade union all select id,"test" from trade ;
> 1 201510210908
> 2 201509080234
> 2 201509080235
> 1 test
> 2 test
> 2 test
> set hive.execution.engine=spark;
> set spark.master=local;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> 1 jone22  1   1   201510210908
> 2 lucy21  1   2   201509080234
> 2 lucy21  1   2   201509080235
> set hive.execution.engine=mr;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
> mapjoin hint. Please remove mapjoin hint.
> I have two questions
> 1.Why result of hive on spark not include the following record?
> 1 jone22  1   1   test
> 2 lucy21  1   2   test
> 2 lucy21  1   2   test
> 2.Why there are two different ways of dealing same query?
> explain 1:
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select id,date_ from trade union all select id,"test" from trade;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), date_ (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Map 2 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), 'test' (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> ListSink
> explain 2:
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> OK
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: 

[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

2015-12-23 Thread JoneZhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JoneZhang updated HIVE-12736:
-
Summary: It seems that result of Hive on Spark be mistaken and result of 
Hive and Hive on Spark are not the same  (was: It seems that result of Hive on 
Spark is mistake And result of Hive and Hive on Spark are not the same)

> It seems that result of Hive on Spark be mistaken and result of Hive and Hive 
> on Spark are not the same
> ---
>
> Key: HIVE-12736
> URL: https://issues.apache.org/jira/browse/HIVE-12736
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>
> select  * from staff;
> 1 jone22  1
> 2 lucy21  1
> 3 hmm 22  2
> 4 james   24  3
> 5 xiaoliu 23  3
> select id,date_ from trade union all select id,"test" from trade ;
> 1 201510210908
> 2 201509080234
> 2 201509080235
> 1 test
> 2 test
> 2 test
> set hive.execution.engine=spark;
> set spark.master=local;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> 1 jone22  1   1   201510210908
> 2 lucy21  1   2   201509080234
> 2 lucy21  1   2   201509080235
> set hive.execution.engine=mr;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
> mapjoin hint. Please remove mapjoin hint.
> I have two questions
> 1.Why result of hive on spark not include the following record?
> 1 jone22  1   1   test
> 2 lucy21  1   2   test
> 2 lucy21  1   2   test
> 2.Why there are two different ways of dealing same query?
> explain 1:
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select id,date_ from trade union all select id,"test" from trade;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), date_ (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Map 2 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), 'test' (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> ListSink
> explain 2:
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> OK
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   DagName: