from:"Chengxiang Li \(JIRA\)"

[jira] [Updated] (HIVE-12205) Spark: unify spark statististics aggregation between local and remote spark client

2016-02-17 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-12205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-12205:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Spark: unify spark statististics aggregation between local and remote spark 
> client
> --
>
> Key: HIVE-12205
> URL: https://issues.apache.org/jira/browse/HIVE-12205
> Project: Hive
>  Issue Type: Task
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Chinna Rao Lalam
> Attachments: HIVE-12205.1.patch, HIVE-12205.2.patch, 
> HIVE-12205.3.patch
>
>
> In class {{LocalSparkJobStatus}} and {{RemoteSparkJobStatus}}, spark 
> statistics aggregation are done similar but in different code paths. Ideally, 
> we should have a unified approach to simply maintenance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12205) Spark: unify spark statististics aggregation between local and remote spark client

2016-02-17 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150268#comment-15150268
 ] 

Chengxiang Li commented on HIVE-12205:
--

Merged to Spark branch, thanks Chinna for this contribution.

> Spark: unify spark statististics aggregation between local and remote spark 
> client
> --
>
> Key: HIVE-12205
> URL: https://issues.apache.org/jira/browse/HIVE-12205
> Project: Hive
>  Issue Type: Task
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Chinna Rao Lalam
> Attachments: HIVE-12205.1.patch, HIVE-12205.2.patch, 
> HIVE-12205.3.patch
>
>
> In class {{LocalSparkJobStatus}} and {{RemoteSparkJobStatus}}, spark 
> statistics aggregation are done similar but in different code paths. Ideally, 
> we should have a unified approach to simply maintenance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12205) Spark: unify spark statististics aggregation between local and remote spark client

2016-02-13 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146312#comment-15146312
 ] 

Chengxiang Li commented on HIVE-12205:
--

+1

> Spark: unify spark statististics aggregation between local and remote spark 
> client
> --
>
> Key: HIVE-12205
> URL: https://issues.apache.org/jira/browse/HIVE-12205
> Project: Hive
>  Issue Type: Task
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Chinna Rao Lalam
> Attachments: HIVE-12205.1.patch, HIVE-12205.2.patch
>
>
> In class {{LocalSparkJobStatus}} and {{RemoteSparkJobStatus}}, spark 
> statistics aggregation are done similar but in different code paths. Ideally, 
> we should have a unified approach to simply maintenance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12205) Spark: unify spark statististics aggregation between local and remote spark client

2016-02-06 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135776#comment-15135776
 ] 

Chengxiang Li commented on HIVE-12205:
--

Thanks, Chinna, i'k

发自我的 iPhone



> Spark: unify spark statististics aggregation between local and remote spark 
> client
> --
>
> Key: HIVE-12205
> URL: https://issues.apache.org/jira/browse/HIVE-12205
> Project: Hive
>  Issue Type: Task
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Chinna Rao Lalam
> Attachments: HIVE-12205.1.patch, HIVE-12205.2.patch
>
>
> In class {{LocalSparkJobStatus}} and {{RemoteSparkJobStatus}}, spark 
> statistics aggregation are done similar but in different code paths. Ideally, 
> we should have a unified approach to simply maintenance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12205) Spark: unify spark statististics aggregation between local and remote spark client

2016-02-06 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135777#comment-15135777
 ] 

Chengxiang Li commented on HIVE-12205:
--

Thanks, Chinna, I'm on vocation now, I would review this when I'm back a week 
later.

>From chengxiang's iPhone



> Spark: unify spark statististics aggregation between local and remote spark 
> client
> --
>
> Key: HIVE-12205
> URL: https://issues.apache.org/jira/browse/HIVE-12205
> Project: Hive
>  Issue Type: Task
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Chinna Rao Lalam
> Attachments: HIVE-12205.1.patch, HIVE-12205.2.patch
>
>
> In class {{LocalSparkJobStatus}} and {{RemoteSparkJobStatus}}, spark 
> statistics aggregation are done similar but in different code paths. Ideally, 
> we should have a unified approach to simply maintenance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7142) Hive multi serialization encoding support

2016-02-01 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127486#comment-15127486
 ] 

Chengxiang Li commented on HIVE-7142:
-

If you want to store data in {{UTF-16}} or {{UTF-32}}, you should set 
{{serizliation.encoding}} to {{UTF-16}} or {{UTF-32}}.

> Hive multi serialization encoding support
> -
>
> Key: HIVE-7142
> URL: https://issues.apache.org/jira/browse/HIVE-7142
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Chengxiang Li
>Assignee: Chengxiang Li
> Fix For: 0.14.0
>
> Attachments: HIVE-7142.1.patch.txt, HIVE-7142.2.patch, 
> HIVE-7142.3.patch, HIVE-7142.4.patch
>
>
> Currently Hive only support serialize data into UTF-8 charset bytes or 
> deserialize from UTF-8 bytes, real world users may want to load different 
> kinds of encoded data into hive directly. This jira is dedicated to support 
> serialize/deserialize all kinds of encoded data in SerDe layer. 
> For user, only need to configure serialization encoding on table level by set 
> serialization encoding through serde parameter, for example:
> {code:sql}
> CREATE TABLE person(id INT, name STRING, desc STRING)ROW FORMAT SERDE 
> 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH 
> SERDEPROPERTIES("serialization.encoding"='GBK');
> {code}
> or
> {code:sql}
> ALTER TABLE person SET SERDEPROPERTIES ('serialization.encoding'='GBK'); 
> {code}
> LIMITATIONS: Only LazySimpleSerDe support "serialization.encoding" property 
> in this patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-12888) TestSparkNegativeCliDriver does not run in Spark mode[Spark Branch]

2016-01-20 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-12888:
-
Attachment: HIVE-12888.1-spark.patch

TestSparkNegativeCliDriver does not add test hive conf dir into its classpath.

> TestSparkNegativeCliDriver does not run in Spark mode[Spark Branch]
> ---
>
> Key: HIVE-12888
> URL: https://issues.apache.org/jira/browse/HIVE-12888
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 1.2.1
>Reporter: Chengxiang Li
>Assignee: Chengxiang Li
> Attachments: HIVE-12888.1-spark.patch
>
>
> During test, i found TestSparkNegativeCliDriver run in MR mode actually, it 
> should be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

2016-01-19 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-12736:
-
Attachment: HIVE-12736.5-spark.patch

[~xuefuz], Yes, it's related, i miss something here. Group By before MapJoin is 
not allowed, and in MR mode, it use {{ReduceSinkOperator}} to check whether 
there is Group By before MapJoin, it has conflict with Spark mode, as mentioned 
before. Instead of validate MapJoin compatibility with other Operators by 
through {{opAllowedBeforeMapJoin()}} and {{opAllowedAfterMapJoin()}}, i should 
be easier and proper to implement through pattern match, i didn't rewrite the 
validation for MR mode, just add new validation logic for Spark mode based on 
pattern match.

> It seems that result of Hive on Spark be mistaken and result of Hive and Hive 
> on Spark are not the same
> ---
>
> Key: HIVE-12736
> URL: https://issues.apache.org/jira/browse/HIVE-12736
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Chengxiang Li
> Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch, 
> HIVE-12736.3-spark.patch, HIVE-12736.4-spark.patch, HIVE-12736.5-spark.patch
>
>
> {code}
> select  * from staff;
> 1 jone22  1
> 2 lucy21  1
> 3 hmm 22  2
> 4 james   24  3
> 5 xiaoliu 23  3
> select id,date_ from trade union all select id,"test" from trade ;
> 1 201510210908
> 2 201509080234
> 2 201509080235
> 1 test
> 2 test
> 2 test
> set hive.execution.engine=spark;
> set spark.master=local;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> 1 jone22  1   1   201510210908
> 2 lucy21  1   2   201509080234
> 2 lucy21  1   2   201509080235
> set hive.execution.engine=mr;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
> mapjoin hint. Please remove mapjoin hint.
> {code}
> I have two questions
> 1.Why result of hive on spark not include the following record?
> {code}
> 1 jone22  1   1   test
> 2 lucy21  1   2   test
> 2 lucy21  1   2   test
> {code}
> 2.Why there are two different ways of dealing same query?
> explain 1:
> {code}
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select id,date_ from trade union all select id,"test" from trade;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), date_ (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Map 2 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), 'test' (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
>

[jira] [Commented] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

2016-01-19 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106575#comment-15106575
 ] 

Chengxiang Li commented on HIVE-12736:
--

Besides, during test, i found TestSparkNegativeCliDriver run in MR mode 
actually, i would create another JIRA to track it.

> It seems that result of Hive on Spark be mistaken and result of Hive and Hive 
> on Spark are not the same
> ---
>
> Key: HIVE-12736
> URL: https://issues.apache.org/jira/browse/HIVE-12736
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Chengxiang Li
> Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch, 
> HIVE-12736.3-spark.patch, HIVE-12736.4-spark.patch, HIVE-12736.5-spark.patch
>
>
> {code}
> select  * from staff;
> 1 jone22  1
> 2 lucy21  1
> 3 hmm 22  2
> 4 james   24  3
> 5 xiaoliu 23  3
> select id,date_ from trade union all select id,"test" from trade ;
> 1 201510210908
> 2 201509080234
> 2 201509080235
> 1 test
> 2 test
> 2 test
> set hive.execution.engine=spark;
> set spark.master=local;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> 1 jone22  1   1   201510210908
> 2 lucy21  1   2   201509080234
> 2 lucy21  1   2   201509080235
> set hive.execution.engine=mr;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
> mapjoin hint. Please remove mapjoin hint.
> {code}
> I have two questions
> 1.Why result of hive on spark not include the following record?
> {code}
> 1 jone22  1   1   test
> 2 lucy21  1   2   test
> 2 lucy21  1   2   test
> {code}
> 2.Why there are two different ways of dealing same query?
> explain 1:
> {code}
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select id,date_ from trade union all select id,"test" from trade;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), date_ (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Map 2 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), 'test' (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> ListSink
> {code}
> explain 2:
> {code}
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
>

[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

2016-01-19 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-12736:
-
Attachment: HIVE-12736.5-spark.patch

I can't reproduce the failed mapjoin_memcheck.q locally, upload the patch again 
to verify.

> It seems that result of Hive on Spark be mistaken and result of Hive and Hive 
> on Spark are not the same
> ---
>
> Key: HIVE-12736
> URL: https://issues.apache.org/jira/browse/HIVE-12736
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Chengxiang Li
> Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch, 
> HIVE-12736.3-spark.patch, HIVE-12736.4-spark.patch, HIVE-12736.5-spark.patch, 
> HIVE-12736.5-spark.patch
>
>
> {code}
> select  * from staff;
> 1 jone22  1
> 2 lucy21  1
> 3 hmm 22  2
> 4 james   24  3
> 5 xiaoliu 23  3
> select id,date_ from trade union all select id,"test" from trade ;
> 1 201510210908
> 2 201509080234
> 2 201509080235
> 1 test
> 2 test
> 2 test
> set hive.execution.engine=spark;
> set spark.master=local;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> 1 jone22  1   1   201510210908
> 2 lucy21  1   2   201509080234
> 2 lucy21  1   2   201509080235
> set hive.execution.engine=mr;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
> mapjoin hint. Please remove mapjoin hint.
> {code}
> I have two questions
> 1.Why result of hive on spark not include the following record?
> {code}
> 1 jone22  1   1   test
> 2 lucy21  1   2   test
> 2 lucy21  1   2   test
> {code}
> 2.Why there are two different ways of dealing same query?
> explain 1:
> {code}
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select id,date_ from trade union all select id,"test" from trade;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), date_ (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Map 2 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), 'test' (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> ListSink
> {code}
> explain 2:
> {code}
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;

[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

2016-01-18 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-12736:
-
Attachment: HIVE-12736.3-spark.patch

Yes, [~xuefuz], {{Operator::opAllowedBeforeMapJoin()}} and 
{{Operator::opAllowedAfterMapJoin()}} are only used for 
{{MapJoinProcessor::validateMapJoinTypes()}}, For MR mode, if there are 
{{ReduceSinkOperator}} before {{MapJoinOperator}}, the {{ReduceSinkOperator}} 
would be removed from the operator tree, so 
{{ReduceSinkOperator::opAllowedBeforeMapJoin()}} would never be accessed in MR 
mode. For Spark mode, only one of two {{ReduceSinkOperator}}s before 
{{MapJoinOperator}} would be removed, if 
{{ReduceSinkOperator::opAllowedBeforeMapJoin()}} return false, all the mapjoin 
with hint would be failed in Spark mode, it actually does not make sense, it 
should only fail while it's {{UnionOperator}} before {{MapJoinOperator}}. So 
the change does not influence MR mode, and it's required by Spark mode.
Besides, i add negative test for mapjoin with hint.

> It seems that result of Hive on Spark be mistaken and result of Hive and Hive 
> on Spark are not the same
> ---
>
> Key: HIVE-12736
> URL: https://issues.apache.org/jira/browse/HIVE-12736
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Chengxiang Li
> Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch, 
> HIVE-12736.3-spark.patch
>
>
> {code}
> select  * from staff;
> 1 jone22  1
> 2 lucy21  1
> 3 hmm 22  2
> 4 james   24  3
> 5 xiaoliu 23  3
> select id,date_ from trade union all select id,"test" from trade ;
> 1 201510210908
> 2 201509080234
> 2 201509080235
> 1 test
> 2 test
> 2 test
> set hive.execution.engine=spark;
> set spark.master=local;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> 1 jone22  1   1   201510210908
> 2 lucy21  1   2   201509080234
> 2 lucy21  1   2   201509080235
> set hive.execution.engine=mr;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
> mapjoin hint. Please remove mapjoin hint.
> {code}
> I have two questions
> 1.Why result of hive on spark not include the following record?
> {code}
> 1 jone22  1   1   test
> 2 lucy21  1   2   test
> 2 lucy21  1   2   test
> {code}
> 2.Why there are two different ways of dealing same query?
> explain 1:
> {code}
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select id,date_ from trade union all select id,"test" from trade;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), date_ (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Map 2 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), 'test' (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96

[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

2016-01-18 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-12736:
-
Attachment: HIVE-12736.4-spark.patch

> It seems that result of Hive on Spark be mistaken and result of Hive and Hive 
> on Spark are not the same
> ---
>
> Key: HIVE-12736
> URL: https://issues.apache.org/jira/browse/HIVE-12736
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Chengxiang Li
> Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch, 
> HIVE-12736.3-spark.patch, HIVE-12736.4-spark.patch
>
>
> {code}
> select  * from staff;
> 1 jone22  1
> 2 lucy21  1
> 3 hmm 22  2
> 4 james   24  3
> 5 xiaoliu 23  3
> select id,date_ from trade union all select id,"test" from trade ;
> 1 201510210908
> 2 201509080234
> 2 201509080235
> 1 test
> 2 test
> 2 test
> set hive.execution.engine=spark;
> set spark.master=local;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> 1 jone22  1   1   201510210908
> 2 lucy21  1   2   201509080234
> 2 lucy21  1   2   201509080235
> set hive.execution.engine=mr;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
> mapjoin hint. Please remove mapjoin hint.
> {code}
> I have two questions
> 1.Why result of hive on spark not include the following record?
> {code}
> 1 jone22  1   1   test
> 2 lucy21  1   2   test
> 2 lucy21  1   2   test
> {code}
> 2.Why there are two different ways of dealing same query?
> explain 1:
> {code}
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select id,date_ from trade union all select id,"test" from trade;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), date_ (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Map 2 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), 'test' (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> ListSink
> {code}
> explain 2:
> {code}
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> OK
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:

[jira] [Commented] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

2016-01-17 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15104080#comment-15104080
 ] 

Chengxiang Li commented on HIVE-12736:
--

[~xuefuz], would you help to review this patch?

> It seems that result of Hive on Spark be mistaken and result of Hive and Hive 
> on Spark are not the same
> ---
>
> Key: HIVE-12736
> URL: https://issues.apache.org/jira/browse/HIVE-12736
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Chengxiang Li
> Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch
>
>
> select  * from staff;
> 1 jone22  1
> 2 lucy21  1
> 3 hmm 22  2
> 4 james   24  3
> 5 xiaoliu 23  3
> select id,date_ from trade union all select id,"test" from trade ;
> 1 201510210908
> 2 201509080234
> 2 201509080235
> 1 test
> 2 test
> 2 test
> set hive.execution.engine=spark;
> set spark.master=local;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> 1 jone22  1   1   201510210908
> 2 lucy21  1   2   201509080234
> 2 lucy21  1   2   201509080235
> set hive.execution.engine=mr;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
> mapjoin hint. Please remove mapjoin hint.
> I have two questions
> 1.Why result of hive on spark not include the following record?
> 1 jone22  1   1   test
> 2 lucy21  1   2   test
> 2 lucy21  1   2   test
> 2.Why there are two different ways of dealing same query?
> explain 1:
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select id,date_ from trade union all select id,"test" from trade;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), date_ (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Map 2 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), 'test' (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> ListSink
> explain 2:
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> OK
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   DagName:

[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

2016-01-12 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-12736:
-
Attachment: HIVE-12736.2-spark.patch

> It seems that result of Hive on Spark be mistaken and result of Hive and Hive 
> on Spark are not the same
> ---
>
> Key: HIVE-12736
> URL: https://issues.apache.org/jira/browse/HIVE-12736
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Chengxiang Li
> Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch
>
>
> select  * from staff;
> 1 jone22  1
> 2 lucy21  1
> 3 hmm 22  2
> 4 james   24  3
> 5 xiaoliu 23  3
> select id,date_ from trade union all select id,"test" from trade ;
> 1 201510210908
> 2 201509080234
> 2 201509080235
> 1 test
> 2 test
> 2 test
> set hive.execution.engine=spark;
> set spark.master=local;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> 1 jone22  1   1   201510210908
> 2 lucy21  1   2   201509080234
> 2 lucy21  1   2   201509080235
> set hive.execution.engine=mr;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
> mapjoin hint. Please remove mapjoin hint.
> I have two questions
> 1.Why result of hive on spark not include the following record?
> 1 jone22  1   1   test
> 2 lucy21  1   2   test
> 2 lucy21  1   2   test
> 2.Why there are two different ways of dealing same query?
> explain 1:
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select id,date_ from trade union all select id,"test" from trade;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), date_ (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Map 2 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), 'test' (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> ListSink
> explain 2:
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> OK
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   DagName: jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:3
>

[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

2016-01-11 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-12736:
-
Attachment: HIVE-12736.1-spark.patch

{{SparkMapJoinProcessor}} miss some validation during {{convertMapJoin}}, for 
Spark mode, the query should work the same way as MR.

> It seems that result of Hive on Spark be mistaken and result of Hive and Hive 
> on Spark are not the same
> ---
>
> Key: HIVE-12736
> URL: https://issues.apache.org/jira/browse/HIVE-12736
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Chengxiang Li
> Attachments: HIVE-12736.1-spark.patch
>
>
> select  * from staff;
> 1 jone22  1
> 2 lucy21  1
> 3 hmm 22  2
> 4 james   24  3
> 5 xiaoliu 23  3
> select id,date_ from trade union all select id,"test" from trade ;
> 1 201510210908
> 2 201509080234
> 2 201509080235
> 1 test
> 2 test
> 2 test
> set hive.execution.engine=spark;
> set spark.master=local;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> 1 jone22  1   1   201510210908
> 2 lucy21  1   2   201509080234
> 2 lucy21  1   2   201509080235
> set hive.execution.engine=mr;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
> mapjoin hint. Please remove mapjoin hint.
> I have two questions
> 1.Why result of hive on spark not include the following record?
> 1 jone22  1   1   test
> 2 lucy21  1   2   test
> 2 lucy21  1   2   test
> 2.Why there are two different ways of dealing same query?
> explain 1:
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select id,date_ from trade union all select id,"test" from trade;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), date_ (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Map 2 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), 'test' (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> ListSink
> explain 2:
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> OK
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage:

[jira] [Commented] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

2016-01-07 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088702#comment-15088702
 ] 

Chengxiang Li commented on HIVE-12736:
--

I would work on this issue.

> It seems that result of Hive on Spark be mistaken and result of Hive and Hive 
> on Spark are not the same
> ---
>
> Key: HIVE-12736
> URL: https://issues.apache.org/jira/browse/HIVE-12736
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Chengxiang Li
>
> select  * from staff;
> 1 jone22  1
> 2 lucy21  1
> 3 hmm 22  2
> 4 james   24  3
> 5 xiaoliu 23  3
> select id,date_ from trade union all select id,"test" from trade ;
> 1 201510210908
> 2 201509080234
> 2 201509080235
> 1 test
> 2 test
> 2 test
> set hive.execution.engine=spark;
> set spark.master=local;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> 1 jone22  1   1   201510210908
> 2 lucy21  1   2   201509080234
> 2 lucy21  1   2   201509080235
> set hive.execution.engine=mr;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
> mapjoin hint. Please remove mapjoin hint.
> I have two questions
> 1.Why result of hive on spark not include the following record?
> 1 jone22  1   1   test
> 2 lucy21  1   2   test
> 2 lucy21  1   2   test
> 2.Why there are two different ways of dealing same query?
> explain 1:
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select id,date_ from trade union all select id,"test" from trade;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), date_ (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Map 2 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), 'test' (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> ListSink
> explain 2:
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> OK
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   DagName: jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:3
>   Vertices:
> Map 1 
> Map

[jira] [Assigned] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

2016-01-07 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li reassigned HIVE-12736:


Assignee: Chengxiang Li  (was: Xuefu Zhang)

> It seems that result of Hive on Spark be mistaken and result of Hive and Hive 
> on Spark are not the same
> ---
>
> Key: HIVE-12736
> URL: https://issues.apache.org/jira/browse/HIVE-12736
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Chengxiang Li
>
> select  * from staff;
> 1 jone22  1
> 2 lucy21  1
> 3 hmm 22  2
> 4 james   24  3
> 5 xiaoliu 23  3
> select id,date_ from trade union all select id,"test" from trade ;
> 1 201510210908
> 2 201509080234
> 2 201509080235
> 1 test
> 2 test
> 2 test
> set hive.execution.engine=spark;
> set spark.master=local;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> 1 jone22  1   1   201510210908
> 2 lucy21  1   2   201509080234
> 2 lucy21  1   2   201509080235
> set hive.execution.engine=mr;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
> mapjoin hint. Please remove mapjoin hint.
> I have two questions
> 1.Why result of hive on spark not include the following record?
> 1 jone22  1   1   test
> 2 lucy21  1   2   test
> 2 lucy21  1   2   test
> 2.Why there are two different ways of dealing same query?
> explain 1:
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select id,date_ from trade union all select id,"test" from trade;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), date_ (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Map 2 
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>   Select Operator
> expressions: id (type: int), 'test' (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> ListSink
> explain 2:
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> OK
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   DagName: jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:3
>   Vertices:
> Map 1 
> Map Operator Tree:

[jira] [Commented] (HIVE-12205) Spark: unify spark statististics aggregation between local and remote spark client

2016-01-07 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088795#comment-15088795
 ] 

Chengxiang Li commented on HIVE-12205:
--

[~chinnalalam], thanks working on this. 
In your patch, the statistic aggregation is still computed separately in 
different methods(although in same class now) for {{LocalSparkJobStatus}} and 
{{RemoteSparkJobStatus}}, i suggest you can add a initialize method in 
{{MetrisCollection}} with parameter {{String jobId, Map jobMetrics}}, so that {{LocalSparkJobStatus}} can reuse 
{{MetricsCollection}} to aggregate statistics as well. What do you think?
Besides, could you create a ticket on RB for this?

> Spark: unify spark statististics aggregation between local and remote spark 
> client
> --
>
> Key: HIVE-12205
> URL: https://issues.apache.org/jira/browse/HIVE-12205
> Project: Hive
>  Issue Type: Task
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Chinna Rao Lalam
> Attachments: HIVE-12205.1.patch
>
>
> In class {{LocalSparkJobStatus}} and {{RemoteSparkJobStatus}}, spark 
> statistics aggregation are done similar but in different code paths. Ideally, 
> we should have a unified approach to simply maintenance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12569) Excessive console message from SparkClientImpl [Spark Branch]

2015-12-02 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037187#comment-15037187
 ] 

Chengxiang Li commented on HIVE-12569:
--

As [~nemon] analyzed, the message comes from Spark side, it should be spark get 
stuck in {{org.apache.spark.deploy.yarn.Client::monitorApplication}} as it 
never get end state. Looks like a spark issue.

> Excessive console message from SparkClientImpl [Spark Branch]
> -
>
> Key: HIVE-12569
> URL: https://issues.apache.org/jira/browse/HIVE-12569
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 2.0.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>Priority: Blocker
>
> {code}
> 15/12/02 11:00:46 INFO client.SparkClientImpl: 15/12/02 11:00:46 INFO Client: 
> Application report for application_1442517343449_0038 (state: RUNNING)
> 15/12/02 11:00:47 INFO client.SparkClientImpl: 15/12/02 11:00:47 INFO Client: 
> Application report for application_1442517343449_0038 (state: RUNNING)
> 15/12/02 11:00:48 INFO client.SparkClientImpl: 15/12/02 11:00:48 INFO Client: 
> Application report for application_1442517343449_0038 (state: RUNNING)
> 15/12/02 11:00:49 INFO client.SparkClientImpl: 15/12/02 11:00:49 INFO Client: 
> Application report for application_1442517343449_0038 (state: RUNNING)
> 15/12/02 11:00:50 INFO client.SparkClientImpl: 15/12/02 11:00:50 INFO Client: 
> Application report for application_1442517343449_0038 (state: RUNNING)
> {code}
> I see this using Hive CLI after a spark job is launched and it goes 
> non-stopping even if the job is finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12515) Clean the SparkCounters related code after remove counter based stats collection[Spark Branch]

2015-12-02 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037160#comment-15037160
 ] 

Chengxiang Li commented on HIVE-12515:
--

LGTM
BTW, if i recall this right, the operator level stats is not used anywhere but 
get printed to console or log for user information. I think it's the right 
decision to keep this.

> Clean the SparkCounters related code after remove counter based stats 
> collection[Spark Branch]
> --
>
> Key: HIVE-12515
> URL: https://issues.apache.org/jira/browse/HIVE-12515
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Chengxiang Li
>Assignee: Rui Li
> Attachments: HIVE-12515.1-spark.patch, HIVE-12515.2-spark.patch
>
>
> As SparkCounters is only used to collection stats, after HIVE-12411, we does 
> not need it anymore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12569) Excessive console message from SparkClientImpl [Spark Branch]

2015-12-02 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037299#comment-15037299
 ] 

Chengxiang Li commented on HIVE-12569:
--

Actually, it's not a exactly issue, although the spark job finished, the spark 
application indeed still alive, so the reported state is right, we just do not 
want to print it on CLI console. change the log level should be the simplest 
solution.

> Excessive console message from SparkClientImpl [Spark Branch]
> -
>
> Key: HIVE-12569
> URL: https://issues.apache.org/jira/browse/HIVE-12569
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 2.0.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>Priority: Blocker
>
> {code}
> 15/12/02 11:00:46 INFO client.SparkClientImpl: 15/12/02 11:00:46 INFO Client: 
> Application report for application_1442517343449_0038 (state: RUNNING)
> 15/12/02 11:00:47 INFO client.SparkClientImpl: 15/12/02 11:00:47 INFO Client: 
> Application report for application_1442517343449_0038 (state: RUNNING)
> 15/12/02 11:00:48 INFO client.SparkClientImpl: 15/12/02 11:00:48 INFO Client: 
> Application report for application_1442517343449_0038 (state: RUNNING)
> 15/12/02 11:00:49 INFO client.SparkClientImpl: 15/12/02 11:00:49 INFO Client: 
> Application report for application_1442517343449_0038 (state: RUNNING)
> 15/12/02 11:00:50 INFO client.SparkClientImpl: 15/12/02 11:00:50 INFO Client: 
> Application report for application_1442517343449_0038 (state: RUNNING)
> {code}
> I see this using Hive CLI after a spark job is launched and it goes 
> non-stopping even if the job is finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12515) Clean the SparkCounters related code after remove counter based stats collection[Spark Branch]

2015-11-27 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029680#comment-15029680
 ] 

Chengxiang Li commented on HIVE-12515:
--

{{SparkCounters}} is referred in lots of classes in HoS, not sure how many code 
changes since last merge with master, we may got many conflicts during merging 
if remove {{SparkCounters}} in master. I think we can just do this in spark 
branch, although 
{{org.apache.hadoop.hive.ql.stats.CounterStatsAggregatorSpark}} has been 
removed, it should be a quite simple conflict during merge.

> Clean the SparkCounters related code after remove counter based stats 
> collection[Spark Branch]
> --
>
> Key: HIVE-12515
> URL: https://issues.apache.org/jira/browse/HIVE-12515
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Chengxiang Li
>Assignee: Xuefu Zhang
>
> As SparkCounters is only used to collection stats, after HIVE-12411, we does 
> not need it anymore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12515) Clean the SparkCounters related code after remove counter based stats collection[Spark Branch]

2015-11-26 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029540#comment-15029540
 ] 

Chengxiang Li commented on HIVE-12515:
--

[~lirui], the {{org.apache.hadoop.hive.ql.stats.CounterStatsAggregatorSpark}} 
is configured with class  name, in a Dynamic Injection style, so there is no 
dependency on compile time, it should be safe to remove.

> Clean the SparkCounters related code after remove counter based stats 
> collection[Spark Branch]
> --
>
> Key: HIVE-12515
> URL: https://issues.apache.org/jira/browse/HIVE-12515
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Chengxiang Li
>Assignee: Xuefu Zhang
>
> As SparkCounters is only used to collection stats, after HIVE-12411, we does 
> not need it anymore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12466) SparkCounter not initialized error

2015-11-24 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025999#comment-15025999
 ] 

Chengxiang Li commented on HIVE-12466:
--

SparkCounters is only used for stats collection now, so yes, i think we may not 
need SparkCounters anymore if counter-based stats collection is removed. As far 
as i know, there is no other Hive features which depends on SparkCounters.

> SparkCounter not initialized error
> --
>
> Key: HIVE-12466
> URL: https://issues.apache.org/jira/browse/HIVE-12466
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-12466.1-spark.patch
>
>
> During a query, lots of the following error found in executor's log:
> {noformat}
> 03:47:28.759 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] 
> has not initialized before.
> 03:47:28.762 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] 
> has not initialized before.
> 03:47:30.707 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.tmp_tmp] has not initialized before.
> 03:47:33.385 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:33.388 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:33.495 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:35.141 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12466) SparkCounter not initialized error

2015-11-24 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026034#comment-15026034
 ] 

Chengxiang Li commented on HIVE-12466:
--

Yes, it does, at least at the time i implemented the counter-based stats 
collection for Spark, it does not relate to any part of our work on HoS, so i 
assume it should work just as well now.

> SparkCounter not initialized error
> --
>
> Key: HIVE-12466
> URL: https://issues.apache.org/jira/browse/HIVE-12466
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-12466.1-spark.patch
>
>
> During a query, lots of the following error found in executor's log:
> {noformat}
> 03:47:28.759 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] 
> has not initialized before.
> 03:47:28.762 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] 
> has not initialized before.
> 03:47:30.707 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.tmp_tmp] has not initialized before.
> 03:47:33.385 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:33.388 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:33.495 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:35.141 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12466) SparkCounter not initialized error

2015-11-24 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026081#comment-15026081
 ] 

Chengxiang Li commented on HIVE-12466:
--

Committed to spark branch, thanks Rui for this contribution.

> SparkCounter not initialized error
> --
>
> Key: HIVE-12466
> URL: https://issues.apache.org/jira/browse/HIVE-12466
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-12466.1-spark.patch
>
>
> During a query, lots of the following error found in executor's log:
> {noformat}
> 03:47:28.759 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] 
> has not initialized before.
> 03:47:28.762 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] 
> has not initialized before.
> 03:47:30.707 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.tmp_tmp] has not initialized before.
> 03:47:33.385 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:33.388 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:33.495 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:35.141 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12466) SparkCounter not initialized error

2015-11-24 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026089#comment-15026089
 ] 

Chengxiang Li commented on HIVE-12466:
--

HIVE-12515 is created for the following cleanup work.

> SparkCounter not initialized error
> --
>
> Key: HIVE-12466
> URL: https://issues.apache.org/jira/browse/HIVE-12466
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-12466.1-spark.patch
>
>
> During a query, lots of the following error found in executor's log:
> {noformat}
> 03:47:28.759 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] 
> has not initialized before.
> 03:47:28.762 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] 
> has not initialized before.
> 03:47:30.707 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.tmp_tmp] has not initialized before.
> 03:47:33.385 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:33.388 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:33.495 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:35.141 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12466) SparkCounter not initialized error

2015-11-24 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024042#comment-15024042
 ] 

Chengxiang Li commented on HIVE-12466:
--

LGTM, wait for the testing.

> SparkCounter not initialized error
> --
>
> Key: HIVE-12466
> URL: https://issues.apache.org/jira/browse/HIVE-12466
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-12466.1-spark.patch
>
>
> During a query, lots of the following error found in executor's log:
> {noformat}
> 03:47:28.759 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] 
> has not initialized before.
> 03:47:28.762 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] 
> has not initialized before.
> 03:47:30.707 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.tmp_tmp] has not initialized before.
> 03:47:33.385 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:33.388 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:33.495 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:35.141 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12466) SparkCounter not initialized error

2015-11-23 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15023574#comment-15023574
 ] 

Chengxiang Li commented on HIVE-12466:
--

Yes, [~lirui], the suffix is available in the operator conf. As recently i 
didn't work on HoS, it would take some time to prepare a test environment, do 
you mind to give a quick fix on this issue? i can do the review work.

> SparkCounter not initialized error
> --
>
> Key: HIVE-12466
> URL: https://issues.apache.org/jira/browse/HIVE-12466
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Xuefu Zhang
>
> During a query, lots of the following error found in executor's log:
> {noformat}
> 03:47:28.759 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] 
> has not initialized before.
> 03:47:28.762 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] 
> has not initialized before.
> 03:47:30.707 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.tmp_tmp] has not initialized before.
> 03:47:33.385 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:33.388 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:33.495 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:35.141 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12466) SparkCounter not initialized error

2015-11-22 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021611#comment-15021611
 ] 

Chengxiang Li commented on HIVE-12466:
--

[~xuefuz], due to the limitation of Spark accumulator, the `SparkCounter` has 
to register the counter name before the job execution. The error message shows 
that specified counter name is not registered before. In default, all the 
default spark counters are collected with `SparkTask::getCounterPrefixes()`, 
`RECORDS_OUT_0`, `RECORDS_OUT_1_default.tmp_tmp` and 
`RECORDS_OUT_1_default.test_table` are not included, seems the counter logic 
changes in `ReduceSinkOperator` and 'FileSinkOperator', we need to update the 
logic of `SparkTask::getOperatorCounters`. 

> SparkCounter not initialized error
> --
>
> Key: HIVE-12466
> URL: https://issues.apache.org/jira/browse/HIVE-12466
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Xuefu Zhang
>
> During a query, lots of the following error found in executor's log:
> {noformat}
> 03:47:28.759 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] 
> has not initialized before.
> 03:47:28.762 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] 
> has not initialized before.
> 03:47:30.707 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.tmp_tmp] has not initialized before.
> 03:47:33.385 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:33.388 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:33.495 [Executor task launch worker-0] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> 03:47:35.141 [Executor task launch worker-1] ERROR 
> org.apache.hive.spark.counter.SparkCounters - counter[HIVE, 
> RECORDS_OUT_1_default.test_table] has not initialized before.
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11533) Loop optimization for SIMD in integer comparisons

2015-10-13 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954723#comment-14954723
 ] 

Chengxiang Li commented on HIVE-11533:
--

Committed to master branch, thanks for the contribution, [~teddy.choi].

> Loop optimization for SIMD in integer comparisons
> -
>
> Key: HIVE-11533
> URL: https://issues.apache.org/jira/browse/HIVE-11533
> Project: Hive
>  Issue Type: Sub-task
>  Components: Vectorization
>Reporter: Teddy Choi
>Assignee: Teddy Choi
>Priority: Minor
> Attachments: HIVE-11533.1.patch, HIVE-11533.2.patch, 
> HIVE-11533.3.patch, HIVE-11533.4.patch, HIVE-11533.5.patch
>
>
> Long*CompareLong* classes can be optimized with subtraction and bitwise 
> operators for better SIMD optimization.
> {code}
> for(int i = 0; i != n; i++) {
>   outputVector[i] = vector1[0] > vector2[i] ? 1 : 0;
> }
> {code}
> This issue will cover following classes;
> - LongColEqualLongColumn
> - LongColNotEqualLongColumn
> - LongColGreaterLongColumn
> - LongColGreaterEqualLongColumn
> - LongColLessLongColumn
> - LongColLessEqualLongColumn
> - LongScalarEqualLongColumn
> - LongScalarNotEqualLongColumn
> - LongScalarGreaterLongColumn
> - LongScalarGreaterEqualLongColumn
> - LongScalarLessLongColumn
> - LongScalarLessEqualLongColumn
> - LongColEqualLongScalar
> - LongColNotEqualLongScalar
> - LongColGreaterLongScalar
> - LongColGreaterEqualLongScalar
> - LongColLessLongScalar
> - LongColLessEqualLongScalar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-11533) Loop optimization for SIMD in integer comparisons

2015-10-13 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-11533:
-
Fix Version/s: 2.0.0

> Loop optimization for SIMD in integer comparisons
> -
>
> Key: HIVE-11533
> URL: https://issues.apache.org/jira/browse/HIVE-11533
> Project: Hive
>  Issue Type: Sub-task
>  Components: Vectorization
>Reporter: Teddy Choi
>Assignee: Teddy Choi
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: HIVE-11533.1.patch, HIVE-11533.2.patch, 
> HIVE-11533.3.patch, HIVE-11533.4.patch, HIVE-11533.5.patch
>
>
> Long*CompareLong* classes can be optimized with subtraction and bitwise 
> operators for better SIMD optimization.
> {code}
> for(int i = 0; i != n; i++) {
>   outputVector[i] = vector1[0] > vector2[i] ? 1 : 0;
> }
> {code}
> This issue will cover following classes;
> - LongColEqualLongColumn
> - LongColNotEqualLongColumn
> - LongColGreaterLongColumn
> - LongColGreaterEqualLongColumn
> - LongColLessLongColumn
> - LongColLessEqualLongColumn
> - LongScalarEqualLongColumn
> - LongScalarNotEqualLongColumn
> - LongScalarGreaterLongColumn
> - LongScalarGreaterEqualLongColumn
> - LongScalarLessLongColumn
> - LongScalarLessEqualLongColumn
> - LongColEqualLongScalar
> - LongColNotEqualLongScalar
> - LongColGreaterLongScalar
> - LongColGreaterEqualLongScalar
> - LongColLessLongScalar
> - LongColLessEqualLongScalar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11533) Loop optimization for SIMD in integer comparisons

2015-10-11 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952519#comment-14952519
 ] 

Chengxiang Li commented on HIVE-11533:
--

+1

> Loop optimization for SIMD in integer comparisons
> -
>
> Key: HIVE-11533
> URL: https://issues.apache.org/jira/browse/HIVE-11533
> Project: Hive
>  Issue Type: Sub-task
>  Components: Vectorization
>Reporter: Teddy Choi
>Assignee: Teddy Choi
>Priority: Minor
> Attachments: HIVE-11533.1.patch, HIVE-11533.2.patch, 
> HIVE-11533.3.patch, HIVE-11533.4.patch, HIVE-11533.5.patch
>
>
> Long*CompareLong* classes can be optimized with subtraction and bitwise 
> operators for better SIMD optimization.
> {code}
> for(int i = 0; i != n; i++) {
>   outputVector[i] = vector1[0] > vector2[i] ? 1 : 0;
> }
> {code}
> This issue will cover following classes;
> - LongColEqualLongColumn
> - LongColNotEqualLongColumn
> - LongColGreaterLongColumn
> - LongColGreaterEqualLongColumn
> - LongColLessLongColumn
> - LongColLessEqualLongColumn
> - LongScalarEqualLongColumn
> - LongScalarNotEqualLongColumn
> - LongScalarGreaterLongColumn
> - LongScalarGreaterEqualLongColumn
> - LongScalarLessLongColumn
> - LongScalarLessEqualLongColumn
> - LongColEqualLongScalar
> - LongColNotEqualLongScalar
> - LongColGreaterLongScalar
> - LongColGreaterEqualLongScalar
> - LongColLessLongScalar
> - LongColLessEqualLongScalar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11533) Loop optimization for SIMD in integer comparisons

2015-10-07 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947998#comment-14947998
 ] 

Chengxiang Li commented on HIVE-11533:
--

Very nice job, the patch looks good, just one thing to remind. I guess the 
performance data is tested with "selectedInUse" is false. while "selectedInUse" 
is true, it could not benefit from SIMD instructions, during my previous 
experience, it might downgrade performance sometimes after the optimization, 
have you verified that?

> Loop optimization for SIMD in integer comparisons
> -
>
> Key: HIVE-11533
> URL: https://issues.apache.org/jira/browse/HIVE-11533
> Project: Hive
>  Issue Type: Sub-task
>  Components: Vectorization
>Reporter: Teddy Choi
>Assignee: Teddy Choi
>Priority: Minor
> Attachments: HIVE-11533.1.patch, HIVE-11533.2.patch, 
> HIVE-11533.3.patch, HIVE-11533.4.patch
>
>
> Long*CompareLong* classes can be optimized with subtraction and bitwise 
> operators for better SIMD optimization.
> {code}
> for(int i = 0; i != n; i++) {
>   outputVector[i] = vector1[0] > vector2[i] ? 1 : 0;
> }
> {code}
> This issue will cover following classes;
> - LongColEqualLongColumn
> - LongColNotEqualLongColumn
> - LongColGreaterLongColumn
> - LongColGreaterEqualLongColumn
> - LongColLessLongColumn
> - LongColLessEqualLongColumn
> - LongScalarEqualLongColumn
> - LongScalarNotEqualLongColumn
> - LongScalarGreaterLongColumn
> - LongScalarGreaterEqualLongColumn
> - LongScalarLessLongColumn
> - LongScalarLessEqualLongColumn
> - LongColEqualLongScalar
> - LongColNotEqualLongScalar
> - LongColGreaterLongScalar
> - LongColGreaterEqualLongScalar
> - LongColLessLongScalar
> - LongColLessEqualLongScalar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10238) Loop optimization for SIMD in IfExprColumnColumn.txt

2015-08-21 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706287#comment-14706287
 ] 

Chengxiang Li commented on HIVE-10238:
--

+1

 Loop optimization for SIMD in IfExprColumnColumn.txt
 

 Key: HIVE-10238
 URL: https://issues.apache.org/jira/browse/HIVE-10238
 Project: Hive
  Issue Type: Sub-task
  Components: Vectorization
Affects Versions: 1.1.0
Reporter: Chengxiang Li
Assignee: Teddy Choi
Priority: Minor
 Attachments: HIVE-10238.2.patch, HIVE-10238.patch


 The ?: operator as following could not be vectorized in loop, we may transfer 
 it into mathematical expression.
 {code:java}
 for(int j = 0; j != n; j++) {
   int i = sel[j];
   outputVector[i] = (vector1[i] == 1 ? vector2[i] : vector3[i]);
   outputIsNull[i] = (vector1[i] == 1 ?
   arg2ColVector.isNull[i] : arg3ColVector.isNull[i]);
 }
 {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10179) Optimization for SIMD instructions in Hive

2015-08-11 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681520#comment-14681520
 ] 

Chengxiang Li commented on HIVE-10179:
--

Yes, [~teddy.choi], it's welcome, feel free to create new subtasks and make 
contribution.

 Optimization for SIMD instructions in Hive
 --

 Key: HIVE-10179
 URL: https://issues.apache.org/jira/browse/HIVE-10179
 Project: Hive
  Issue Type: Improvement
Reporter: Chengxiang Li
Assignee: Chengxiang Li
  Labels: optimization

 [SIMD|http://en.wikipedia.org/wiki/SIMD] instuctions could be found in most 
 of current CPUs, such as Intel's SSE2, SSE3, SSE4.x, AVX and AVX2, and it 
 would help Hive to outperform if we can vectorize the mathematical 
 manipulation part of Hive. This umbrella JIRA may contains but not limited to 
 the subtasks like:
 # Code schema adaption, current JVM is quite strictly on the code schema 
 which could be transformed into SIMD instructions during execution. 
 # New implementation of mathematical manipulation part of Hive which designed 
 to be optimized for SIMD instructions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10238) Loop optimization for SIMD in IfExprColumnColumn.txt

2015-08-11 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681506#comment-14681506
 ] 

Chengxiang Li commented on HIVE-10238:
--

Thanks for looking at this, [~teddy.choi]. I tried to use bitwise operators 
before as well, and found it does not perform better, but i'm not sure whether 
it's same with yours, you add new benchmark tests to verify that. If you are 
interesting in this issue, please feel free to reassign it to yourself and keep 
working on it.

 Loop optimization for SIMD in IfExprColumnColumn.txt
 

 Key: HIVE-10238
 URL: https://issues.apache.org/jira/browse/HIVE-10238
 Project: Hive
  Issue Type: Sub-task
  Components: Vectorization
Affects Versions: 1.1.0
Reporter: Chengxiang Li
Assignee: Chengxiang Li
Priority: Minor

 The ?: operator as following could not be vectorized in loop, we may transfer 
 it into mathematical expression.
 {code:java}
 for(int j = 0; j != n; j++) {
   int i = sel[j];
   outputVector[i] = (vector1[i] == 1 ? vector2[i] : vector3[i]);
   outputIsNull[i] = (vector1[i] == 1 ?
   arg2ColVector.isNull[i] : arg3ColVector.isNull[i]);
 }
 {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10238) Loop optimization for SIMD in IfExprColumnColumn.txt

2015-08-11 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692798#comment-14692798
 ] 

Chengxiang Li commented on HIVE-10238:
--

Hi, [~teddy.choi], could you upload the patch to Review Board for better review?

 Loop optimization for SIMD in IfExprColumnColumn.txt
 

 Key: HIVE-10238
 URL: https://issues.apache.org/jira/browse/HIVE-10238
 Project: Hive
  Issue Type: Sub-task
  Components: Vectorization
Affects Versions: 1.1.0
Reporter: Chengxiang Li
Assignee: Teddy Choi
Priority: Minor
 Attachments: HIVE-10238.2.patch, HIVE-10238.patch


 The ?: operator as following could not be vectorized in loop, we may transfer 
 it into mathematical expression.
 {code:java}
 for(int j = 0; j != n; j++) {
   int i = sel[j];
   outputVector[i] = (vector1[i] == 1 ? vector2[i] : vector3[i]);
   outputIsNull[i] = (vector1[i] == 1 ?
   arg2ColVector.isNull[i] : arg3ColVector.isNull[i]);
 }
 {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11276) Optimization around job submission and adding jars [Spark Branch]

2015-07-16 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-11276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630682#comment-14630682
 ] 

Chengxiang Li commented on HIVE-11276:
--

[~xuefuz], I review the the code in RemoteHiveSparkClient, the reason why it 
need to invoke refreshLocalResources() for every job submission is that Hive 
user may use ADD \[FILE|JAR|ARCHIVE\] value command to add resources on 
runtime, so spark client need to upload these resources to spark cluster before 
job execution. RemoteHiveSparkClient have a list which records all the 
resources it has uploaded to spark cluster, and use it to filter out already 
uploaded jars during refreshLocalResources(), only new added jars would be 
uploaded to spark cluster, and the list should have a quite small size at most 
time, so i think it should not has performance issue here.

 Optimization around job submission and adding jars [Spark Branch]
 -

 Key: HIVE-11276
 URL: https://issues.apache.org/jira/browse/HIVE-11276
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: 1.1.0
Reporter: Xuefu Zhang
Assignee: Chengxiang Li

 It seems that Hive on Spark has some room for performance improvement on job 
 submission. Specifically, we are calling refreshLocalResources() for every 
 job submission despite there is are no changes in the jar list. Since Hive on 
 Spark is reusing the containers in the whole user session, we might be able 
 to optimize that.
 We do need to take into consideration the case of dynamic allocation, in 
 which new executors might be added.
 This task is some RD in this area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11276) Optimization around job submission and adding jars [Spark Branch]

2015-07-16 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-11276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630791#comment-14630791
 ] 

Chengxiang Li commented on HIVE-11276:
--

That make sense to me, launch the spark cluster during first query execution 
would mislead the user that Hive on Spark is slower than it actually does. 
Besides, we may also open spark session while user set hive.execution.engine to 
spark.

 Optimization around job submission and adding jars [Spark Branch]
 -

 Key: HIVE-11276
 URL: https://issues.apache.org/jira/browse/HIVE-11276
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: 1.1.0
Reporter: Xuefu Zhang
Assignee: Chengxiang Li

 It seems that Hive on Spark has some room for performance improvement on job 
 submission. Specifically, we are calling refreshLocalResources() for every 
 job submission despite there is are no changes in the jar list. Since Hive on 
 Spark is reusing the containers in the whole user session, we might be able 
 to optimize that.
 We do need to take into consideration the case of dynamic allocation, in 
 which new executors might be added.
 This task is some RD in this area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HIVE-11276) Optimization around job submission and adding jars [Spark Branch]

2015-07-16 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-11276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li reassigned HIVE-11276:


Assignee: Chengxiang Li

 Optimization around job submission and adding jars [Spark Branch]
 -

 Key: HIVE-11276
 URL: https://issues.apache.org/jira/browse/HIVE-11276
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: 1.1.0
Reporter: Xuefu Zhang
Assignee: Chengxiang Li

 It seems that Hive on Spark has some room for performance improvement on job 
 submission. Specifically, we are calling refreshLocalResources() for every 
 job submission despite there is are no changes in the jar list. Since Hive on 
 Spark is reusing the containers in the whole user session, we might be able 
 to optimize that.
 We do need to take into consideration the case of dynamic allocation, in 
 which new executors might be added.
 This task is some RD in this area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11276) Optimization around job submission and adding jars [Spark Branch]

2015-07-16 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-11276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630689#comment-14630689
 ] 

Chengxiang Li commented on HIVE-11276:
--

Besides, for the case of dynamic allocation, i'm not sure whether it would be 
influenced by this. From my point of view, as we use Spark API like 
SparkContext::addJar()/addFile() to upload resources to SparkCluster, after 
that, it should be Spark's responsibility to make sure it's executor JVM load 
these resources. From the experience of my previous test of dynamic allocation, 
everything works well.

 Optimization around job submission and adding jars [Spark Branch]
 -

 Key: HIVE-11276
 URL: https://issues.apache.org/jira/browse/HIVE-11276
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: 1.1.0
Reporter: Xuefu Zhang
Assignee: Chengxiang Li

 It seems that Hive on Spark has some room for performance improvement on job 
 submission. Specifically, we are calling refreshLocalResources() for every 
 job submission despite there is are no changes in the jar list. Since Hive on 
 Spark is reusing the containers in the whole user session, we might be able 
 to optimize that.
 We do need to take into consideration the case of dynamic allocation, in 
 which new executors might be added.
 This task is some RD in this area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11267) Combine equavilent leaf works in SparkWork[Spark Branch]

2015-07-15 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-11267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629069#comment-14629069
 ] 

Chengxiang Li commented on HIVE-11267:
--

[~xuefuz], I took a look at the FileSinkOperator implementation before, the 
write logic is quite complicated, and write multi times would break several its 
design rules. I don't want to change FileSinkOperator a lot for this special 
case optimization. Fetch twice would be just few lines of code change and more 
efficient(SparkWork only write once). Actually we can check the exists of 
FetchTask, if it does not exist, we can skip this optimization.


 Combine equavilent leaf works in SparkWork[Spark Branch]
 

 Key: HIVE-11267
 URL: https://issues.apache.org/jira/browse/HIVE-11267
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
Priority: Minor

 There could be multi leaf works in SparkWork, like self-union query. If the 
 subqueries are same with each other, we may combine the subqueries, and just 
 execute once, then fetch twice in FetchTask.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-11082) Support multi edge between nodes in SparkPlan[Spark Branch]

2015-07-15 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-11082:
-
Attachment: HIVE-11082.3-spark.patch

fix nit format issue.

 Support multi edge between nodes in SparkPlan[Spark Branch]
 ---

 Key: HIVE-11082
 URL: https://issues.apache.org/jira/browse/HIVE-11082
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-11082.1-spark.patch, HIVE-11082.2-spark.patch, 
 HIVE-11082.3-spark.patch


 For Dynamic RDD caching optimization, we found SparkPlan::connect throw 
 exception while we try to combine 2 works with same child, support multi edge 
 between nodes in SparkPlan would help to enable dynamic RDD caching in more 
 use cases, like self join and self union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-11082) Support multi edge between nodes in SparkPlan[Spark Branch]

2015-07-15 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-11082:
-
Fix Version/s: spark-branch

 Support multi edge between nodes in SparkPlan[Spark Branch]
 ---

 Key: HIVE-11082
 URL: https://issues.apache.org/jira/browse/HIVE-11082
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Fix For: spark-branch

 Attachments: HIVE-11082.1-spark.patch, HIVE-11082.2-spark.patch, 
 HIVE-11082.3-spark.patch


 For Dynamic RDD caching optimization, we found SparkPlan::connect throw 
 exception while we try to combine 2 works with same child, support multi edge 
 between nodes in SparkPlan would help to enable dynamic RDD caching in more 
 use cases, like self join and self union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11204) Research on recent failed qtests[Spark Branch]

2015-07-15 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-11204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627641#comment-14627641
 ] 

Chengxiang Li commented on HIVE-11204:
--

Ok, I would left this issue open until next merge from master, and verify after 
the merge.

 Research on recent failed qtests[Spark Branch]
 --

 Key: HIVE-11204
 URL: https://issues.apache.org/jira/browse/HIVE-11204
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Chengxiang Li
Assignee: Chengxiang Li
Priority: Minor
 Attachments: HIVE-11204.1-spark.patch, HIVE-11204.1-spark.patch


 Found some strange failed qtests in HIVE-11053 Hive QA, as it's pretty sure 
 that failed qtests are not related to HIVE-11053 patch, so just reproduce and 
 research it here.
 Failed tests:
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_bigdata
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_resolution
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_sort_1_23
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join_literals
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_mapreduce1
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_skewjoinopt2
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_smb_mapjoin_15
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_19
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_4
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_8
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_view



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-11082) Support multi edge between nodes in SparkPlan[Spark Branch]

2015-07-15 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-11082:
-
Attachment: HIVE-11082.2-spark.patch

update related qtest output.

 Support multi edge between nodes in SparkPlan[Spark Branch]
 ---

 Key: HIVE-11082
 URL: https://issues.apache.org/jira/browse/HIVE-11082
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-11082.1-spark.patch, HIVE-11082.2-spark.patch


 For Dynamic RDD caching optimization, we found SparkPlan::connect throw 
 exception while we try to combine 2 works with same child, support multi edge 
 between nodes in SparkPlan would help to enable dynamic RDD caching in more 
 use cases, like self join and self union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-11204) Research on recent failed qtests[Spark Branch]

2015-07-14 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-11204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-11204:
-
Attachment: HIVE-11204.1-spark.patch

 Research on recent failed qtests[Spark Branch]
 --

 Key: HIVE-11204
 URL: https://issues.apache.org/jira/browse/HIVE-11204
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Chengxiang Li
Assignee: Chengxiang Li
Priority: Minor
 Attachments: HIVE-11204.1-spark.patch, HIVE-11204.1-spark.patch


 Found some strange failed qtests in HIVE-11053 Hive QA, as it's pretty sure 
 that failed qtests are not related to HIVE-11053 patch, so just reproduce and 
 research it here.
 Failed tests:
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_bigdata
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_resolution
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_sort_1_23
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join_literals
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_mapreduce1
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_skewjoinopt2
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_smb_mapjoin_15
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_19
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_4
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_8
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_view



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-11082) Support multi edge between nodes in SparkPlan[Spark Branch]

2015-07-14 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-11082:
-
Attachment: HIVE-11082.1-spark.patch

SparkPlan support multi edges between nodes in default, just remove the check 
during SparkPlan::connect. 
But self join/union does not benifit from RDD caching with this patch actually, 
as self join/union would set different alia names to the source table, which 
make the ReduceSinkOperators in different MapWork do not equals with each 
other.  

 Support multi edge between nodes in SparkPlan[Spark Branch]
 ---

 Key: HIVE-11082
 URL: https://issues.apache.org/jira/browse/HIVE-11082
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-11082.1-spark.patch


 For Dynamic RDD caching optimization, we found SparkPlan::connect throw 
 exception while we try to combine 2 works with same child, support multi edge 
 between nodes in SparkPlan would help to enable dynamic RDD caching in more 
 use cases, like self join and self union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11082) Support multi edge between nodes in SparkPlan[Spark Branch]

2015-07-14 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627414#comment-14627414
 ] 

Chengxiang Li commented on HIVE-11082:
--

There seems some failed test, [~xuefuz], i would check what's going on.

 Support multi edge between nodes in SparkPlan[Spark Branch]
 ---

 Key: HIVE-11082
 URL: https://issues.apache.org/jira/browse/HIVE-11082
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-11082.1-spark.patch


 For Dynamic RDD caching optimization, we found SparkPlan::connect throw 
 exception while we try to combine 2 works with same child, support multi edge 
 between nodes in SparkPlan would help to enable dynamic RDD caching in more 
 use cases, like self join and self union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11204) Research on recent failed qtests[Spark Branch]

2015-07-14 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-11204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627413#comment-14627413
 ] 

Chengxiang Li commented on HIVE-11204:
--

[~xuefuz], all the above initializationError is due some kind of file missing 
issue, like:
{code:java}
java.io.FileNotFoundException: 
/data/hive-ptest/working/apache-git-source-source/itests/qtest/target/generated-test-sources/java/org/apache/hadoop/hive/cli/TestCliDriverQFileNames.txt
{code}
These files should be generated during the maven generate-test-sources phase, I 
can not reproduce these issue on my local environment, it does not looks like 
Hive logic error, do you have any idea why these issue happens?

 Research on recent failed qtests[Spark Branch]
 --

 Key: HIVE-11204
 URL: https://issues.apache.org/jira/browse/HIVE-11204
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Chengxiang Li
Assignee: Chengxiang Li
Priority: Minor
 Attachments: HIVE-11204.1-spark.patch, HIVE-11204.1-spark.patch


 Found some strange failed qtests in HIVE-11053 Hive QA, as it's pretty sure 
 that failed qtests are not related to HIVE-11053 patch, so just reproduce and 
 research it here.
 Failed tests:
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_bigdata
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_resolution
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_sort_1_23
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join_literals
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_mapreduce1
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_skewjoinopt2
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_smb_mapjoin_15
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_19
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_4
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_8
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_view



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11082) Support multi edge between nodes in SparkPlan[Spark Branch]

2015-07-14 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627405#comment-14627405
 ] 

Chengxiang Li commented on HIVE-11082:
--

It's easy to ignore the alias name during the comparison, what stop me doing 
that is the execution logic afterword. The following operators distinguish 
different inputs by the alias name, as there different table logically, we 
would lose the alias information if combine the MapWorks.
One possible optimization is cut the ReduceSinkOperator into a separate 
MapWork, so that we could cache the previous MapWork which include the operator 
chain before ReduceSinkOperator. This optimization require Hive on Spark 
support appendable MapWork, like MapWork -- MapWork -- ReuceWork, or MapWork 
-- ReduceWork -- MapWork. 

 Support multi edge between nodes in SparkPlan[Spark Branch]
 ---

 Key: HIVE-11082
 URL: https://issues.apache.org/jira/browse/HIVE-11082
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-11082.1-spark.patch


 For Dynamic RDD caching optimization, we found SparkPlan::connect throw 
 exception while we try to combine 2 works with same child, support multi edge 
 between nodes in SparkPlan would help to enable dynamic RDD caching in more 
 use cases, like self join and self union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HIVE-11204) Research on recent failed qtests[Spark Branch]

2015-07-12 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-11204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li reassigned HIVE-11204:


Assignee: Chengxiang Li

 Research on recent failed qtests[Spark Branch]
 --

 Key: HIVE-11204
 URL: https://issues.apache.org/jira/browse/HIVE-11204
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Chengxiang Li
Assignee: Chengxiang Li
Priority: Minor

 Found some strange failed qtests in HIVE-11053 Hive QA, as it's pretty sure 
 that failed qtests are not related to HIVE-11053 patch, so just reproduce and 
 research it here.
 Failed tests:
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_bigdata
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_resolution
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_sort_1_23
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join_literals
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_mapreduce1
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_skewjoinopt2
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_smb_mapjoin_15
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_19
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_4
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_8
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_view



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-11204) Research on recent failed qtests[Spark Branch]

2015-07-12 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-11204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-11204:
-
Attachment: HIVE-11204.1-spark.patch

Can not reproduce locally now, just upload an empty patch to verify.

 Research on recent failed qtests[Spark Branch]
 --

 Key: HIVE-11204
 URL: https://issues.apache.org/jira/browse/HIVE-11204
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Chengxiang Li
Assignee: Chengxiang Li
Priority: Minor
 Attachments: HIVE-11204.1-spark.patch


 Found some strange failed qtests in HIVE-11053 Hive QA, as it's pretty sure 
 that failed qtests are not related to HIVE-11053 patch, so just reproduce and 
 research it here.
 Failed tests:
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_bigdata
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_resolution
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_sort_1_23
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join_literals
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_mapreduce1
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_skewjoinopt2
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_smb_mapjoin_15
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_19
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_4
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_8
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_view



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HIVE-11082) Support multi edge between nodes in SparkPlan[Spark Branch]

2015-07-12 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li reassigned HIVE-11082:


Assignee: Chengxiang Li

 Support multi edge between nodes in SparkPlan[Spark Branch]
 ---

 Key: HIVE-11082
 URL: https://issues.apache.org/jira/browse/HIVE-11082
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li

 For Dynamic RDD caching optimization, we found SparkPlan::connect throw 
 exception while we try to combine 2 works with same child, support multi edge 
 between nodes in SparkPlan would help to enable dynamic RDD caching in more 
 use cases, like self join and self union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11053) Add more tests for HIVE-10844[Spark Branch]

2015-07-08 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618401#comment-14618401
 ] 

Chengxiang Li commented on HIVE-11053:
--

Committed to spark branch, thanks [~gallenvara_bg] for the contribution.

 Add more tests for HIVE-10844[Spark Branch]
 ---

 Key: HIVE-11053
 URL: https://issues.apache.org/jira/browse/HIVE-11053
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Chengxiang Li
Assignee: GaoLun
Priority: Minor
 Fix For: spark-branch

 Attachments: HIVE-11053.1-spark.patch, HIVE-11053.2-spark.patch, 
 HIVE-11053.3-spark.patch, HIVE-11053.4-spark.patch, HIVE-11053.5-spark.patch, 
 HIVE-11053.5-spark.patch


 Add some test cases for self union, self-join, CWE, and repeated sub-queries 
 to verify the job of combining quivalent works in HIVE-10844.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-11053) Add more tests for HIVE-10844[Spark Branch]

2015-07-08 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-11053:
-
Fix Version/s: spark-branch

 Add more tests for HIVE-10844[Spark Branch]
 ---

 Key: HIVE-11053
 URL: https://issues.apache.org/jira/browse/HIVE-11053
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Chengxiang Li
Assignee: GaoLun
Priority: Minor
 Fix For: spark-branch

 Attachments: HIVE-11053.1-spark.patch, HIVE-11053.2-spark.patch, 
 HIVE-11053.3-spark.patch, HIVE-11053.4-spark.patch, HIVE-11053.5-spark.patch, 
 HIVE-11053.5-spark.patch


 Add some test cases for self union, self-join, CWE, and repeated sub-queries 
 to verify the job of combining quivalent works in HIVE-10844.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11053) Add more tests for HIVE-10844[Spark Branch]

2015-07-08 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618235#comment-14618235
 ] 

Chengxiang Li commented on HIVE-11053:
--

The failed spark tests should not related to this patch, i would create another 
JIRA to track it.

 Add more tests for HIVE-10844[Spark Branch]
 ---

 Key: HIVE-11053
 URL: https://issues.apache.org/jira/browse/HIVE-11053
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Chengxiang Li
Assignee: GaoLun
Priority: Minor
 Attachments: HIVE-11053.1-spark.patch, HIVE-11053.2-spark.patch, 
 HIVE-11053.3-spark.patch, HIVE-11053.4-spark.patch, HIVE-11053.5-spark.patch, 
 HIVE-11053.5-spark.patch


 Add some test cases for self union, self-join, CWE, and repeated sub-queries 
 to verify the job of combining quivalent works in HIVE-10844.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11053) Add more tests for HIVE-10844[Spark Branch]

2015-07-08 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618236#comment-14618236
 ] 

Chengxiang Li commented on HIVE-11053:
--

+1

 Add more tests for HIVE-10844[Spark Branch]
 ---

 Key: HIVE-11053
 URL: https://issues.apache.org/jira/browse/HIVE-11053
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Chengxiang Li
Assignee: GaoLun
Priority: Minor
 Attachments: HIVE-11053.1-spark.patch, HIVE-11053.2-spark.patch, 
 HIVE-11053.3-spark.patch, HIVE-11053.4-spark.patch, HIVE-11053.5-spark.patch, 
 HIVE-11053.5-spark.patch


 Add some test cases for self union, self-join, CWE, and repeated sub-queries 
 to verify the job of combining quivalent works in HIVE-10844.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10850) Followup for HIVE-10550, check performance w.r.t. persistence level [Spark Branch]

2015-07-07 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-10850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-10850:
-
Assignee: GaoLun  (was: Chengxiang Li)

 Followup for HIVE-10550, check performance w.r.t. persistence level [Spark 
 Branch]
 --

 Key: HIVE-10850
 URL: https://issues.apache.org/jira/browse/HIVE-10850
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: 1.2.0, 1.1.0
Reporter: Xuefu Zhang
Assignee: GaoLun

 In HIVE-10550, there was a discussion on the persistence level and whether we 
 need to give user some control over this. This JIRA is to investigate more, 
 especially measuring performance under difference conditions, and further the 
 discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-11053) Add more tests for HIVE-10844[Spark Branch]

2015-07-07 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-11053:
-
Attachment: HIVE-11053.5-spark.patch

upload the patch to relaunch unit test.

 Add more tests for HIVE-10844[Spark Branch]
 ---

 Key: HIVE-11053
 URL: https://issues.apache.org/jira/browse/HIVE-11053
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Chengxiang Li
Assignee: GaoLun
Priority: Minor
 Attachments: HIVE-11053.1-spark.patch, HIVE-11053.2-spark.patch, 
 HIVE-11053.3-spark.patch, HIVE-11053.4-spark.patch, HIVE-11053.5-spark.patch, 
 HIVE-11053.5-spark.patch


 Add some test cases for self union, self-join, CWE, and repeated sub-queries 
 to verify the job of combining quivalent works in HIVE-10844.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused

2015-06-30 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607606#comment-14607606
 ] 

Chengxiang Li commented on HIVE-11095:
--

Hi, [~xiaowei], After get +1, it need wait 24 hours before commit to make sure 
others has opportunity to review as well, just the way how community works, 
patch looks good.

 SerDeUtils  another bug ,when Text is reused
 

 Key: HIVE-11095
 URL: https://issues.apache.org/jira/browse/HIVE-11095
 Project: Hive
  Issue Type: Bug
  Components: API, CLI
Affects Versions: 0.14.0, 1.0.0, 1.2.0
 Environment: Hadoop 2.3.0-cdh5.0.0
 Hive 0.14
Reporter: xiaowei wang
Assignee: xiaowei wang
 Fix For: 2.0.0

 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt, 
 HIVE-11095.3.patch.txt


 {noformat}
 The method transformTextFromUTF8 have a  error bug, It invoke a bad method of 
 Text,getBytes()!
 The method getBytes of Text returns the raw bytes; however, only data up to 
 Text.length is valid.A better way is  use copyBytes()  if you need the 
 returned array to be precisely the length of the data.
 But the copyBytes is added behind hadoop1. 
 {noformat}
 How I found this bug？
 When i query data from a lzo table ， I found in results ： the length of the 
 current row is always largr than the previous row， and sometimes，the current 
 row contains the contents of the previous row。 For example ，i execute a sql ,
 {code:sql}
 select * from web_searchhub where logdate=2015061003
 {code}
 the result of sql see blow.Notice that ,the second row content contains the 
 first row content.
 {noformat}
 INFO [03:00:05.589] HttpFrontServer::FrontSH 
 msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003
 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 
 session=901,thread=223ession=3151,thread=254 2015061003
 {noformat}
 The content of origin lzo file content see below ,just 2 rows.
 {noformat}
 INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb 
 session=3148,thread=285
 INFO [03:00:05.635] HttpFrontServer::FrontSH 
 msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285
 {noformat}
 I think this error is caused by the Text reuse,and I found the solutions .
 Addicational, table create sql is : 
 {code:sql}
 CREATE EXTERNAL TABLE `web_searchhub`(
 `line` string)
 PARTITIONED BY (
 `logdate` string)
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY '
 U'
 WITH SERDEPROPERTIES (
 'serialization.encoding'='GBK')
 STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat
 OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat;
 LOCATION
 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused

2015-06-30 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607607#comment-14607607
 ] 

Chengxiang Li commented on HIVE-11095:
--

Hi, [~xiaowei], After get +1, it need wait 24 hours before commit to make sure 
others has opportunity to review as well, just the way how community works, 
patch looks good.

 SerDeUtils  another bug ,when Text is reused
 

 Key: HIVE-11095
 URL: https://issues.apache.org/jira/browse/HIVE-11095
 Project: Hive
  Issue Type: Bug
  Components: API, CLI
Affects Versions: 0.14.0, 1.0.0, 1.2.0
 Environment: Hadoop 2.3.0-cdh5.0.0
 Hive 0.14
Reporter: xiaowei wang
Assignee: xiaowei wang
 Fix For: 2.0.0

 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt, 
 HIVE-11095.3.patch.txt


 {noformat}
 The method transformTextFromUTF8 have a  error bug, It invoke a bad method of 
 Text,getBytes()!
 The method getBytes of Text returns the raw bytes; however, only data up to 
 Text.length is valid.A better way is  use copyBytes()  if you need the 
 returned array to be precisely the length of the data.
 But the copyBytes is added behind hadoop1. 
 {noformat}
 How I found this bug？
 When i query data from a lzo table ， I found in results ： the length of the 
 current row is always largr than the previous row， and sometimes，the current 
 row contains the contents of the previous row。 For example ，i execute a sql ,
 {code:sql}
 select * from web_searchhub where logdate=2015061003
 {code}
 the result of sql see blow.Notice that ,the second row content contains the 
 first row content.
 {noformat}
 INFO [03:00:05.589] HttpFrontServer::FrontSH 
 msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003
 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 
 session=901,thread=223ession=3151,thread=254 2015061003
 {noformat}
 The content of origin lzo file content see below ,just 2 rows.
 {noformat}
 INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb 
 session=3148,thread=285
 INFO [03:00:05.635] HttpFrontServer::FrontSH 
 msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285
 {noformat}
 I think this error is caused by the Text reuse,and I found the solutions .
 Addicational, table create sql is : 
 {code:sql}
 CREATE EXTERNAL TABLE `web_searchhub`(
 `line` string)
 PARTITIONED BY (
 `logdate` string)
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY '
 U'
 WITH SERDEPROPERTIES (
 'serialization.encoding'='GBK')
 STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat
 OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat;
 LOCATION
 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11138) Query fails when there isn't a comparator for an operator [Spark Branch]

2015-06-30 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-11138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607574#comment-14607574
 ] 

Chengxiang Li commented on HIVE-11138:
--

+1, patch LGTM.

 Query fails when there isn't a comparator for an operator [Spark Branch]
 

 Key: HIVE-11138
 URL: https://issues.apache.org/jira/browse/HIVE-11138
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Rui Li
Assignee: Rui Li
 Attachments: HIVE-11138.1-spark.patch


 In such case, OperatorComparatorFactory should default to false instead of 
 throw exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10983) SerDeUtils bug ,when Text is reused

2015-06-26 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602551#comment-14602551
 ] 

Chengxiang Li commented on HIVE-10983:
--

Nice found, thanks for working on this issue, [~xiaowei].
For the patch, do you think we can just use 
{code:java}
return new Text(new String(text.getBytes(), 0, text.getLength(), 
previousCharset))
{code}
so that we do not need extra memory copy introduced in the patch.

 SerDeUtils bug  ,when Text is reused 
 -

 Key: HIVE-10983
 URL: https://issues.apache.org/jira/browse/HIVE-10983
 Project: Hive
  Issue Type: Bug
  Components: API, CLI
Affects Versions: 0.14.0, 1.0.0, 1.2.0
 Environment: Hadoop 2.3.0-cdh5.0.0
 Hive 0.14
Reporter: xiaowei wang
Assignee: xiaowei wang
  Labels: patch
 Fix For: 0.14.1, 1.2.0

 Attachments: HIVE-10983.1.patch.txt, HIVE-10983.2.patch.txt


 {noformat}
 The mothod transformTextToUTF8 have a error bug,It invoke a bad method of 
 Text,getBytes()!
 The method getBytes of Text returns the raw bytes; however, only data up to 
 Text.length is valid.A better way is  use copyBytes()  if you need the 
 returned array to be precisely the length of the data.
 But the copyBytes is added behind hadoop1. 
 {noformat}
 When i query data from a lzo table ， I found  in results ： the length of the 
 current row is always largr  than the previous row， and sometimes，the current 
  row contains the contents of the previous row。 For example ，i execute a sql ,
 {code:sql}
 select *   from web_searchhub where logdate=2015061003
 {code}
 the result of sql see blow.Notice that ,the second row content contains the 
 first row content.
 {noformat}
 INFO [03:00:05.589] HttpFrontServer::FrontSH 
 msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003
 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 
 session=901,thread=223ession=3151,thread=254 2015061003
 {noformat}
 The content  of origin lzo file content see below ,just 2 rows.
 {noformat}
 INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb 
 session=3148,thread=285
 INFO [03:00:05.635] HttpFrontServer::FrontSH 
 msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285
 {noformat}
 I think this error is caused by the Text reuse,and I found the solutions .
 Addicational, table create sql is : 
 {code:sql}
 CREATE EXTERNAL TABLE `web_searchhub`(
   `line` string)
 PARTITIONED BY (
   `logdate` string)
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\\U'
 WITH SERDEPROPERTIES (
   'serialization.encoding'='GBK')
 STORED AS INPUTFORMAT  com.hadoop.mapred.DeprecatedLzoTextInputFormat
   OUTPUTFORMAT 
 org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat;
 LOCATION
   'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10983) SerDeUtils bug ,when Text is reused

2015-06-26 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602637#comment-14602637
 ] 

Chengxiang Li commented on HIVE-10983:
--

Great, [~xiaowei], let's wait for the unit test result. Besides, could you also 
test it with your own test case.

 SerDeUtils bug  ,when Text is reused 
 -

 Key: HIVE-10983
 URL: https://issues.apache.org/jira/browse/HIVE-10983
 Project: Hive
  Issue Type: Bug
  Components: API, CLI
Affects Versions: 0.14.0, 1.0.0, 1.2.0
 Environment: Hadoop 2.3.0-cdh5.0.0
 Hive 0.14
Reporter: xiaowei wang
Assignee: xiaowei wang
  Labels: patch
 Fix For: 0.14.1, 1.2.0

 Attachments: HIVE-10983.1.patch.txt, HIVE-10983.2.patch.txt, 
 HIVE-10983.3.patch.txt, HIVE-10983.4.patch.txt


 {noformat}
 The mothod transformTextToUTF8 have a error bug,It invoke a bad method of 
 Text,getBytes()!
 The method getBytes of Text returns the raw bytes; however, only data up to 
 Text.length is valid.A better way is  use copyBytes()  if you need the 
 returned array to be precisely the length of the data.
 But the copyBytes is added behind hadoop1. 
 {noformat}
 When i query data from a lzo table ， I found  in results ： the length of the 
 current row is always largr  than the previous row， and sometimes，the current 
  row contains the contents of the previous row。 For example ，i execute a sql ,
 {code:sql}
 select *   from web_searchhub where logdate=2015061003
 {code}
 the result of sql see blow.Notice that ,the second row content contains the 
 first row content.
 {noformat}
 INFO [03:00:05.589] HttpFrontServer::FrontSH 
 msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003
 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 
 session=901,thread=223ession=3151,thread=254 2015061003
 {noformat}
 The content  of origin lzo file content see below ,just 2 rows.
 {noformat}
 INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb 
 session=3148,thread=285
 INFO [03:00:05.635] HttpFrontServer::FrontSH 
 msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285
 {noformat}
 I think this error is caused by the Text reuse,and I found the solutions .
 Addicational, table create sql is : 
 {code:sql}
 CREATE EXTERNAL TABLE `web_searchhub`(
   `line` string)
 PARTITIONED BY (
   `logdate` string)
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\\U'
 WITH SERDEPROPERTIES (
   'serialization.encoding'='GBK')
 STORED AS INPUTFORMAT  com.hadoop.mapred.DeprecatedLzoTextInputFormat
   OUTPUTFORMAT 
 org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat;
 LOCATION
   'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused

2015-06-26 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602714#comment-14602714
 ] 

Chengxiang Li commented on HIVE-11095:
--

[~xiaowei], this should be the same issue as HIVE-10983, normally, we desire to 
handle it in a single JIRA, would you like to merge this patch into HIVE-10983? 

 SerDeUtils  another bug ,when Text is reused
 

 Key: HIVE-11095
 URL: https://issues.apache.org/jira/browse/HIVE-11095
 Project: Hive
  Issue Type: Bug
  Components: API, CLI
Affects Versions: 0.14.0, 1.0.0, 1.2.0
 Environment: Hadoop 2.3.0-cdh5.0.0
 Hive 0.14
Reporter: xiaowei wang
Assignee: xiaowei wang
 Fix For: 1.2.0

 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt


 {noformat}
 The method transformTextFromUTF8 have a  error bug, It invoke a bad method of 
 Text,getBytes()!
 The method getBytes of Text returns the raw bytes; however, only data up to 
 Text.length is valid.A better way is  use copyBytes()  if you need the 
 returned array to be precisely the length of the data.
 But the copyBytes is added behind hadoop1. 
 {noformat}
 How I found this bug？
 When i query data from a lzo table ， I found in results ： the length of the 
 current row is always largr than the previous row， and sometimes，the current 
 row contains the contents of the previous row。 For example ，i execute a sql ,
 {code:sql}
 select * from web_searchhub where logdate=2015061003
 {code}
 the result of sql see blow.Notice that ,the second row content contains the 
 first row content.
 {noformat}
 INFO [03:00:05.589] HttpFrontServer::FrontSH 
 msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003
 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 
 session=901,thread=223ession=3151,thread=254 2015061003
 {noformat}
 The content of origin lzo file content see below ,just 2 rows.
 {noformat}
 INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb 
 session=3148,thread=285
 INFO [03:00:05.635] HttpFrontServer::FrontSH 
 msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285
 {noformat}
 I think this error is caused by the Text reuse,and I found the solutions .
 Addicational, table create sql is : 
 {code:sql}
 CREATE EXTERNAL TABLE `web_searchhub`(
 `line` string)
 PARTITIONED BY (
 `logdate` string)
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY '
 U'
 WITH SERDEPROPERTIES (
 'serialization.encoding'='GBK')
 STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat
 OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat;
 LOCATION
 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10999) Upgrade Spark dependency to 1.4 [Spark Branch]

2015-06-24 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598935#comment-14598935
 ] 

Chengxiang Li commented on HIVE-10999:
--

The classpath update code change looks good to me, i'm +1 on this patch.

 Upgrade Spark dependency to 1.4 [Spark Branch]
 --

 Key: HIVE-10999
 URL: https://issues.apache.org/jira/browse/HIVE-10999
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Rui Li
 Attachments: HIVE-10999.1-spark.patch, HIVE-10999.2-spark.patch, 
 HIVE-10999.3-spark.patch, HIVE-10999.3-spark.patch


 Spark 1.4.0 is release. Let's update the dependency version from 1.3.1 to 
 1.4.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10844) Combine equivalent Works for HoS[Spark Branch]

2015-06-23 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-10844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-10844:
-
Attachment: HIVE-10844.3-spark.patch

 Combine equivalent Works for HoS[Spark Branch]
 --

 Key: HIVE-10844
 URL: https://issues.apache.org/jira/browse/HIVE-10844
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-10844.1-spark.patch, HIVE-10844.2-spark.patch, 
 HIVE-10844.3-spark.patch


 Some Hive queries(like [TPCDS 
 Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql])
  may share the same subquery, which translated into sperate, but equivalent 
 Works in SparkWork, combining these equivalent Works into a single one would 
 help to benifit from following dynamic RDD caching optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-11053) Add more tests for HIVE-10844[Spark Branch]

2015-06-23 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-11053:
-
Assignee: GAOLUN

 Add more tests for HIVE-10844[Spark Branch]
 ---

 Key: HIVE-11053
 URL: https://issues.apache.org/jira/browse/HIVE-11053
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: GAOLUN
Priority: Minor

 Add some test cases for self union, self-join, CWE, and repeated sub-queries 
 to verify the job of combining quivalent works in HIVE-10844.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10999) Upgrade Spark dependency to 1.4 [Spark Branch]

2015-06-23 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598755#comment-14598755
 ] 

Chengxiang Li commented on HIVE-10999:
--

Seems the latest upload patch pass all the tests, except 
org.apache.hadoop.hive.cli.TestCliDriver.initializationError. ：）

 Upgrade Spark dependency to 1.4 [Spark Branch]
 --

 Key: HIVE-10999
 URL: https://issues.apache.org/jira/browse/HIVE-10999
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Rui Li
 Attachments: HIVE-10999.1-spark.patch, HIVE-10999.2-spark.patch, 
 HIVE-10999.3-spark.patch, HIVE-10999.3-spark.patch


 Spark 1.4.0 is release. Let's update the dependency version from 1.3.1 to 
 1.4.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-11053) Add more tests for HIVE-10844[Spark Branch]

2015-06-23 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-11053:
-
Assignee: GAOLUN

 Add more tests for HIVE-10844[Spark Branch]
 ---

 Key: HIVE-11053
 URL: https://issues.apache.org/jira/browse/HIVE-11053
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: GAOLUN
Priority: Minor

 Add some test cases for self union, self-join, CWE, and repeated sub-queries 
 to verify the job of combining quivalent works in HIVE-10844.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-11053) Add more tests for HIVE-10844[Spark Branch]

2015-06-23 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-11053:
-
Assignee: (was: GAOLUN)

 Add more tests for HIVE-10844[Spark Branch]
 ---

 Key: HIVE-11053
 URL: https://issues.apache.org/jira/browse/HIVE-11053
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Priority: Minor

 Add some test cases for self union, self-join, CWE, and repeated sub-queries 
 to verify the job of combining quivalent works in HIVE-10844.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10844) Combine equivalent Works for HoS[Spark Branch]

2015-06-17 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591169#comment-14591169
 ] 

Chengxiang Li commented on HIVE-10844:
--

The failed test should be irrelevant, [~xuefuz], the patch is ready for review 
now.

 Combine equivalent Works for HoS[Spark Branch]
 --

 Key: HIVE-10844
 URL: https://issues.apache.org/jira/browse/HIVE-10844
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-10844.1-spark.patch, HIVE-10844.2-spark.patch


 Some Hive queries(like [TPCDS 
 Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql])
  may share the same subquery, which translated into sperate, but equivalent 
 Works in SparkWork, combining these equivalent Works into a single one would 
 help to benifit from following dynamic RDD caching optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10844) Combine equivalent Works for HoS[Spark Branch]

2015-06-17 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-10844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-10844:
-
Attachment: HIVE-10844.2-spark.patch

 Combine equivalent Works for HoS[Spark Branch]
 --

 Key: HIVE-10844
 URL: https://issues.apache.org/jira/browse/HIVE-10844
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-10844.1-spark.patch, HIVE-10844.2-spark.patch


 Some Hive queries(like [TPCDS 
 Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql])
  may share the same subquery, which translated into sperate, but equivalent 
 Works in SparkWork, combining these equivalent Works into a single one would 
 help to benifit from following dynamic RDD caching optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9370) SparkJobMonitor timeout as sortByKey would launch extra Spark job before original job get submitted [Spark Branch]

2015-06-01 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-9370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567055#comment-14567055
 ] 

Chengxiang Li commented on HIVE-9370:
-

Thanks for asking, [~leftylev]. We have printed the error message to CLI 
console, i think we do not need to notify this to user on document especially.

 SparkJobMonitor timeout as sortByKey would launch extra Spark job before 
 original job get submitted [Spark Branch]
 --

 Key: HIVE-9370
 URL: https://issues.apache.org/jira/browse/HIVE-9370
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: yuyun.chen
Assignee: Chengxiang Li
 Fix For: 1.1.0

 Attachments: HIVE-9370.1-spark.patch


 enable hive on spark and run BigBench Query 8 then got the following 
 exception:
 2015-01-14 11:43:46,057 INFO  [main]: impl.RemoteSparkJobStatus 
 (RemoteSparkJobStatus.java:getSparkJobInfo(143)) - Job hasn't been submitted 
 after 30s. Aborting it.
 2015-01-14 11:43:46,061 INFO  [main]: impl.RemoteSparkJobStatus 
 (RemoteSparkJobStatus.java:getSparkJobInfo(143)) - Job hasn't been submitted 
 after 30s. Aborting it.
 2015-01-14 11:43:46,061 ERROR [main]: status.SparkJobMonitor 
 (SessionState.java:printError(839)) - Status: Failed
 2015-01-14 11:43:46,062 INFO  [main]: log.PerfLogger 
 (PerfLogger.java:PerfLogEnd(148)) - /PERFLOG method=SparkRunJob 
 start=1421206996052 end=1421207026062 duration=30010 
 from=org.apache.hadoop.hive.ql.exec.spark.status.SparkJobMonitor
 2015-01-14 11:43:46,071 INFO  [stderr-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(436)) - 15/01/14 11:43:46 INFO RemoteDriver: Failed 
 to run job 0a9a7782-0e0b-4561-8468-959a6d8df0a3
 2015-01-14 11:43:46,071 INFO  [stderr-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(436)) - java.lang.InterruptedException
 2015-01-14 11:43:46,071 INFO  [stderr-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(436)) -at java.lang.Object.wait(Native 
 Method)
 2015-01-14 11:43:46,071 INFO  [stderr-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(436)) -at 
 java.lang.Object.wait(Object.java:503)
 2015-01-14 11:43:46,071 INFO  [stderr-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(436)) -at 
 org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
 2015-01-14 11:43:46,071 INFO  [stderr-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(436)) -at 
 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514)
 2015-01-14 11:43:46,071 INFO  [stderr-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(436)) -at 
 org.apache.spark.SparkContext.runJob(SparkContext.scala:1282)
 2015-01-14 11:43:46,072 INFO  [stderr-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(436)) -at 
 org.apache.spark.SparkContext.runJob(SparkContext.scala:1300)
 2015-01-14 11:43:46,072 INFO  [stderr-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(436)) -at 
 org.apache.spark.SparkContext.runJob(SparkContext.scala:1314)
 2015-01-14 11:43:46,072 INFO  [stderr-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(436)) -at 
 org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
 2015-01-14 11:43:46,072 INFO  [stderr-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(436)) -at 
 org.apache.spark.rdd.RDD.collect(RDD.scala:780)
 2015-01-14 11:43:46,072 INFO  [stderr-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(436)) -at 
 org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:262)
 2015-01-14 11:43:46,072 INFO  [stderr-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(436)) -at 
 org.apache.spark.RangePartitioner.init(Partitioner.scala:124)
 2015-01-14 11:43:46,072 INFO  [stderr-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(436)) -at 
 org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:63)
 2015-01-14 11:43:46,073 INFO  [stderr-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(436)) -at 
 org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:894)
 2015-01-14 11:43:46,073 INFO  [stderr-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(436)) -at 
 org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:864)
 2015-01-14 11:43:46,073 INFO  [stderr-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(436)) -at 
 org.apache.hadoop.hive.ql.exec.spark.SortByShuffler.shuffle(SortByShuffler.java:48)
 2015-01-14 11:43:46,073 INFO  [stderr-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(436)) -at

[jira] [Updated] (HIVE-10844) Combine equivalent Works for HoS[Spark Branch]

2015-05-27 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-10844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-10844:
-
Attachment: HIVE-10844.1-spark.patch

 Combine equivalent Works for HoS[Spark Branch]
 --

 Key: HIVE-10844
 URL: https://issues.apache.org/jira/browse/HIVE-10844
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-10844.1-spark.patch


 Some Hive queries(like [TPCDS 
 Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql])
  may share the same subquery, which translated into sperate, but equivalent 
 Works in SparkWork, combining these equivalent Works into a single one would 
 help to benifit from following dynamic RDD caching optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]

2015-05-27 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562312#comment-14562312
 ] 

Chengxiang Li commented on HIVE-10550:
--

Committed to spark branch, thanks [~xuefuz] for review.

 Dynamic RDD caching optimization for HoS.[Spark Branch]
 ---

 Key: HIVE-10550
 URL: https://issues.apache.org/jira/browse/HIVE-10550
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, 
 HIVE-10550.2-spark.patch, HIVE-10550.3-spark.patch, HIVE-10550.4-spark.patch, 
 HIVE-10550.5-spark.patch, HIVE-10550.6-spark.patch


 A Hive query may try to scan the same table multi times, like self-join, 
 self-union, or even share the same subquery, [TPC-DS 
 Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql]
  is an example. As you may know that, Spark support cache RDD data, which 
 mean Spark would put the calculated RDD data in memory and get the data from 
 memory directly for next time, this avoid the calculation cost of this 
 RDD(and all the cost of its dependencies) at the cost of more memory usage. 
 Through analyze the query context, we should be able to understand which part 
 of query could be shared, so that we can reuse the cached RDD in the 
 generated Spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]

2015-05-27 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562241#comment-14562241
 ] 

Chengxiang Li commented on HIVE-10550:
--

Note: these configurations has been removed in latest patch.

 Dynamic RDD caching optimization for HoS.[Spark Branch]
 ---

 Key: HIVE-10550
 URL: https://issues.apache.org/jira/browse/HIVE-10550
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, 
 HIVE-10550.2-spark.patch, HIVE-10550.3-spark.patch, HIVE-10550.4-spark.patch, 
 HIVE-10550.5-spark.patch, HIVE-10550.6-spark.patch


 A Hive query may try to scan the same table multi times, like self-join, 
 self-union, or even share the same subquery, [TPC-DS 
 Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql]
  is an example. As you may know that, Spark support cache RDD data, which 
 mean Spark would put the calculated RDD data in memory and get the data from 
 memory directly for next time, this avoid the calculation cost of this 
 RDD(and all the cost of its dependencies) at the cost of more memory usage. 
 Through analyze the query context, we should be able to understand which part 
 of query could be shared, so that we can reuse the cached RDD in the 
 generated Spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]

2015-05-27 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-10550:
-
Attachment: HIVE-10550.6-spark.patch

 Dynamic RDD caching optimization for HoS.[Spark Branch]
 ---

 Key: HIVE-10550
 URL: https://issues.apache.org/jira/browse/HIVE-10550
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, 
 HIVE-10550.2-spark.patch, HIVE-10550.3-spark.patch, HIVE-10550.4-spark.patch, 
 HIVE-10550.5-spark.patch, HIVE-10550.6-spark.patch


 A Hive query may try to scan the same table multi times, like self-join, 
 self-union, or even share the same subquery, [TPC-DS 
 Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql]
  is an example. As you may know that, Spark support cache RDD data, which 
 mean Spark would put the calculated RDD data in memory and get the data from 
 memory directly for next time, this avoid the calculation cost of this 
 RDD(and all the cost of its dependencies) at the cost of more memory usage. 
 Through analyze the query context, we should be able to understand which part 
 of query could be shared, so that we can reuse the cached RDD in the 
 generated Spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]

2015-05-26 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-10550:
-
Attachment: HIVE-10550.5-spark.patch

 Dynamic RDD caching optimization for HoS.[Spark Branch]
 ---

 Key: HIVE-10550
 URL: https://issues.apache.org/jira/browse/HIVE-10550
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, 
 HIVE-10550.2-spark.patch, HIVE-10550.3-spark.patch, HIVE-10550.4-spark.patch, 
 HIVE-10550.5-spark.patch


 A Hive query may try to scan the same table multi times, like self-join, 
 self-union, or even share the same subquery, [TPC-DS 
 Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql]
  is an example. As you may know that, Spark support cache RDD data, which 
 mean Spark would put the calculated RDD data in memory and get the data from 
 memory directly for next time, this avoid the calculation cost of this 
 RDD(and all the cost of its dependencies) at the cost of more memory usage. 
 Through analyze the query context, we should be able to understand which part 
 of query could be shared, so that we can reuse the cached RDD in the 
 generated Spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]

2015-05-22 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-10550:
-
Attachment: HIVE-10550.4-spark.patch

 Dynamic RDD caching optimization for HoS.[Spark Branch]
 ---

 Key: HIVE-10550
 URL: https://issues.apache.org/jira/browse/HIVE-10550
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, 
 HIVE-10550.2-spark.patch, HIVE-10550.3-spark.patch, HIVE-10550.4-spark.patch


 A Hive query may try to scan the same table multi times, like self-join, 
 self-union, or even share the same subquery, [TPC-DS 
 Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql]
  is an example. As you may know that, Spark support cache RDD data, which 
 mean Spark would put the calculated RDD data in memory and get the data from 
 memory directly for next time, this avoid the calculation cost of this 
 RDD(and all the cost of its dependencies) at the cost of more memory usage. 
 Through analyze the query context, we should be able to understand which part 
 of query could be shared, so that we can reuse the cached RDD in the 
 generated Spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]

2015-05-18 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-10550:
-
Attachment: HIVE-10550.2-spark.patch

 Dynamic RDD caching optimization for HoS.[Spark Branch]
 ---

 Key: HIVE-10550
 URL: https://issues.apache.org/jira/browse/HIVE-10550
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, 
 HIVE-10550.2-spark.patch


 A Hive query may try to scan the same table multi times, like self-join, 
 self-union, or even share the same subquery, [TPC-DS 
 Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql]
  is an example. As you may know that, Spark support cache RDD data, which 
 mean Spark would put the calculated RDD data in memory and get the data from 
 memory directly for next time, this avoid the calculation cost of this 
 RDD(and all the cost of its dependencies) at the cost of more memory usage. 
 Through analyze the query context, we should be able to understand which part 
 of query could be shared, so that we can reuse the cached RDD in the 
 generated Spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]

2015-05-18 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-10550:
-
Attachment: (was: HIVE-10550.2-spark.patch)

 Dynamic RDD caching optimization for HoS.[Spark Branch]
 ---

 Key: HIVE-10550
 URL: https://issues.apache.org/jira/browse/HIVE-10550
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, 
 HIVE-10550.2-spark.patch


 A Hive query may try to scan the same table multi times, like self-join, 
 self-union, or even share the same subquery, [TPC-DS 
 Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql]
  is an example. As you may know that, Spark support cache RDD data, which 
 mean Spark would put the calculated RDD data in memory and get the data from 
 memory directly for next time, this avoid the calculation cost of this 
 RDD(and all the cost of its dependencies) at the cost of more memory usage. 
 Through analyze the query context, we should be able to understand which part 
 of query could be shared, so that we can reuse the cached RDD in the 
 generated Spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]

2015-05-18 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-10550:
-
Attachment: HIVE-10550.2-spark.patch

 Dynamic RDD caching optimization for HoS.[Spark Branch]
 ---

 Key: HIVE-10550
 URL: https://issues.apache.org/jira/browse/HIVE-10550
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, 
 HIVE-10550.2-spark.patch


 A Hive query may try to scan the same table multi times, like self-join, 
 self-union, or even share the same subquery, [TPC-DS 
 Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql]
  is an example. As you may know that, Spark support cache RDD data, which 
 mean Spark would put the calculated RDD data in memory and get the data from 
 memory directly for next time, this avoid the calculation cost of this 
 RDD(and all the cost of its dependencies) at the cost of more memory usage. 
 Through analyze the query context, we should be able to understand which part 
 of query could be shared, so that we can reuse the cached RDD in the 
 generated Spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]

2015-05-18 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547807#comment-14547807
 ] 

Chengxiang Li commented on HIVE-10550:
--

I'm not sure why, but i keep failed to upload patch to hive-git repo on our RB, 
i would try again later, [~xuefuz], do you mind to review on 
github(https://github.com/apache/hive/pull/36) at first?

 Dynamic RDD caching optimization for HoS.[Spark Branch]
 ---

 Key: HIVE-10550
 URL: https://issues.apache.org/jira/browse/HIVE-10550
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, 
 HIVE-10550.2-spark.patch


 A Hive query may try to scan the same table multi times, like self-join, 
 self-union, or even share the same subquery, [TPC-DS 
 Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql]
  is an example. As you may know that, Spark support cache RDD data, which 
 mean Spark would put the calculated RDD data in memory and get the data from 
 memory directly for next time, this avoid the calculation cost of this 
 RDD(and all the cost of its dependencies) at the cost of more memory usage. 
 Through analyze the query context, we should be able to understand which part 
 of query could be shared, so that we can reuse the cached RDD in the 
 generated Spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]

2015-05-13 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541662#comment-14541662
 ] 

Chengxiang Li commented on HIVE-10550:
--

New added configuration:
||name||default value||
|hive.spark.dynamic.rdd.caching|true|
|hive.spark.dynamic.rdd.caching.threshold|100 * 1024 * 1024L(100M)|

 Dynamic RDD caching optimization for HoS.[Spark Branch]
 ---

 Key: HIVE-10550
 URL: https://issues.apache.org/jira/browse/HIVE-10550
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-10550.1.patch


 A Hive query may try to scan the same table multi times, like self-join, 
 self-union, or even share the same subquery, [TPC-DS 
 Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql]
  is an example. As you may know that, Spark support cache RDD data, which 
 mean Spark would put the calculated RDD data in memory and get the data from 
 memory directly for next time, this avoid the calculation cost of this 
 RDD(and all the cost of its dependencies) at the cost of more memory usage. 
 Through analyze the query context, we should be able to understand which part 
 of query could be shared, so that we can reuse the cached RDD in the 
 generated Spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]

2015-05-13 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-10550:
-
Attachment: HIVE-10550.1.patch

 Dynamic RDD caching optimization for HoS.[Spark Branch]
 ---

 Key: HIVE-10550
 URL: https://issues.apache.org/jira/browse/HIVE-10550
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Attachments: HIVE-10550.1.patch


 A Hive query may try to scan the same table multi times, like self-join, 
 self-union, or even share the same subquery, [TPC-DS 
 Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql]
  is an example. As you may know that, Spark support cache RDD data, which 
 mean Spark would put the calculated RDD data in memory and get the data from 
 memory directly for next time, this avoid the calculation cost of this 
 RDD(and all the cost of its dependencies) at the cost of more memory usage. 
 Through analyze the query context, we should be able to understand which part 
 of query could be shared, so that we can reuse the cached RDD in the 
 generated Spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10671) yarn-cluster mode offers a degraded performance from yarn-client [Spark Branch]

2015-05-13 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543222#comment-14543222
 ] 

Chengxiang Li commented on HIVE-10671:
--

LGTM, +1

 yarn-cluster mode offers a degraded performance from yarn-client [Spark 
 Branch]
 ---

 Key: HIVE-10671
 URL: https://issues.apache.org/jira/browse/HIVE-10671
 Project: Hive
  Issue Type: Bug
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Rui Li
 Attachments: HIVE-10671.1-spark.patch


 With Hive on Spark, users noticed that in certain cases 
 spark.master=yarn-client offers 2x or 3x better performance than if 
 spark.master=yarn-cluster. However, yarn-cluster is what we recommend and 
 support. Thus, we should investigate and fix the problem. One of the such 
 queries is TPC-H  22.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10548) Remove dependency to s3 repository in root pom

2015-05-12 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541373#comment-14541373
 ] 

Chengxiang Li commented on HIVE-10548:
--

Committed to master, thanks Szehon for review.

 Remove dependency to s3 repository in root pom
 --

 Key: HIVE-10548
 URL: https://issues.apache.org/jira/browse/HIVE-10548
 Project: Hive
  Issue Type: Bug
  Components: Build Infrastructure
Reporter: Szehon Ho
Assignee: Chengxiang Li
 Attachments: HIVE-10548.2.patch, HIVE-10548.2.patch, HIVE-10548.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10548) Remove dependency to s3 repository in root pom

2015-05-07 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-10548:
-
Attachment: HIVE-10548.2.patch

 Remove dependency to s3 repository in root pom
 --

 Key: HIVE-10548
 URL: https://issues.apache.org/jira/browse/HIVE-10548
 Project: Hive
  Issue Type: Bug
  Components: Build Infrastructure
Reporter: Szehon Ho
Assignee: Chengxiang Li
 Attachments: HIVE-10548.2.patch, HIVE-10548.2.patch, HIVE-10548.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10548) Remove dependency to s3 repository in root pom

2015-05-06 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-10548:
-
Attachment: HIVE-10548.2.patch

 Remove dependency to s3 repository in root pom
 --

 Key: HIVE-10548
 URL: https://issues.apache.org/jira/browse/HIVE-10548
 Project: Hive
  Issue Type: Bug
  Components: Build Infrastructure
Reporter: Szehon Ho
Assignee: Szehon Ho
 Attachments: HIVE-10548.2.patch, HIVE-10548.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10235) Loop optimization for SIMD in ColumnDivideColumn.txt

2015-04-14 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495522#comment-14495522
 ] 

Chengxiang Li commented on HIVE-10235:
--

Environment:
java version 1.8.0_40
Java(TM) SE Runtime Environment (build 1.8.0_40-b26)
Java HotSpot(TM) 64-Bit Server VM (build 25.40-b25, mixed mode)
Intel(R) Core(TM) i3-2130 CPU @ 3.40GHz
Linux version 2.6.32-279.el6.x86_64

 Loop optimization for SIMD in ColumnDivideColumn.txt
 

 Key: HIVE-10235
 URL: https://issues.apache.org/jira/browse/HIVE-10235
 Project: Hive
  Issue Type: Sub-task
  Components: Vectorization
Affects Versions: 1.1.0
Reporter: Chengxiang Li
Assignee: Chengxiang Li
Priority: Minor
 Attachments: HIVE-10235.1.patch


 Found two loop which could be optimized for packed instruction set during 
 execution.
 1. hasDivBy0 depends on the result of last loop, which prevent the loop be 
 executed vectorized.
 {code:java}
 for(int i = 0; i != n; i++) {
   OperandType2 denom = vector2[i];
   outputVector[i] = vector1[0] OperatorSymbol denom;
   hasDivBy0 = hasDivBy0 || (denom == 0);
 }
 {code}
 2. same as HIVE-10180, vector2\[0\] reference provent JVM optimizing loop 
 into packed instruction set.
 {code:java}
 for(int i = 0; i != n; i++) {
   outputVector[i] = vector1[i] OperatorSymbol vector2[0];
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10235) Loop optimization for SIMD in ColumnDivideColumn.txt

2015-04-14 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495524#comment-14495524
 ] 

Chengxiang Li commented on HIVE-10235:
--

The failed test is irrelevant, [~gopalv], could you help to review this patch?

 Loop optimization for SIMD in ColumnDivideColumn.txt
 

 Key: HIVE-10235
 URL: https://issues.apache.org/jira/browse/HIVE-10235
 Project: Hive
  Issue Type: Sub-task
  Components: Vectorization
Affects Versions: 1.1.0
Reporter: Chengxiang Li
Assignee: Chengxiang Li
Priority: Minor
 Attachments: HIVE-10235.1.patch


 Found two loop which could be optimized for packed instruction set during 
 execution.
 1. hasDivBy0 depends on the result of last loop, which prevent the loop be 
 executed vectorized.
 {code:java}
 for(int i = 0; i != n; i++) {
   OperandType2 denom = vector2[i];
   outputVector[i] = vector1[0] OperatorSymbol denom;
   hasDivBy0 = hasDivBy0 || (denom == 0);
 }
 {code}
 2. same as HIVE-10180, vector2\[0\] reference provent JVM optimizing loop 
 into packed instruction set.
 {code:java}
 for(int i = 0; i != n; i++) {
   outputVector[i] = vector1[i] OperatorSymbol vector2[0];
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10180) Loop optimization for SIMD in ColumnArithmeticColumn.txt

2015-04-12 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491819#comment-14491819
 ] 

Chengxiang Li commented on HIVE-10180:
--

Committed to trunk, thanks Gopal for review.

 Loop optimization for SIMD in ColumnArithmeticColumn.txt
 

 Key: HIVE-10180
 URL: https://issues.apache.org/jira/browse/HIVE-10180
 Project: Hive
  Issue Type: Sub-task
Reporter: Chengxiang Li
Assignee: Chengxiang Li
Priority: Minor
 Attachments: HIVE-10180.1.patch, HIVE-10180.2.patch


 JVM is quite strict on the code schema which may executed with SIMD 
 instructions, take a loop in DoubleColAddDoubleColumn.java for example, 
 {code:java}
 for (int i = 0; i != n; i++) {
   outputVector[i] = vector1[0] + vector2[i];
 }
 {code}
 The vector1[0] reference would prevent JVM to execute this part of code 
 with vectorized instructions, we need to assign the vector1[0] to a 
 variable outside of loop, and use that variable in loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10235) Loop optimization for SIMD in ColumnDivideColumn.txt

2015-04-12 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-10235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-10235:
-
Attachment: HIVE-10235.1.patch

Test with Jmh VectorizationBench by the following command:
{code:actionscript}
java -jar hive-jmh/target/benchmarks.jar 
org.apache.hive.benchmark.vectorization VectorizationBench -wi 3 -i 5 -f 1 -bm 
avgt -tu ms
{code}
The performance result looks like
||Expressions||/w patch(ms)||/w/o patch(ms)||
|DoubleColDivideDoubleColumn|4033|6654|
|DoubleColDivideRepeatingDoubleColumn|1563|3048|
|LongColDivideLongColumn|7354|7561|
|LongColDivideRepeatingColumn|3161|3163|
For for double array division in loop, the packed instruction vdivpd is used 
instead of vdivsd with patch applied, while there is no such instruction for 
long division, so there is no improvement for long array division in loop.

 Loop optimization for SIMD in ColumnDivideColumn.txt
 

 Key: HIVE-10235
 URL: https://issues.apache.org/jira/browse/HIVE-10235
 Project: Hive
  Issue Type: Sub-task
  Components: Vectorization
Affects Versions: 1.1.0
Reporter: Chengxiang Li
Assignee: Chengxiang Li
Priority: Minor
 Attachments: HIVE-10235.1.patch


 Found two loop which could be optimized for packed instruction set during 
 execution.
 1. hasDivBy0 depends on the result of last loop, which prevent the loop be 
 executed vectorized.
 {code:java}
 for(int i = 0; i != n; i++) {
   OperandType2 denom = vector2[i];
   outputVector[i] = vector1[0] OperatorSymbol denom;
   hasDivBy0 = hasDivBy0 || (denom == 0);
 }
 {code}
 2. same as HIVE-10180, vector2\[0\] reference provent JVM optimizing loop 
 into packed instruction set.
 {code:java}
 for(int i = 0; i != n; i++) {
   outputVector[i] = vector1[i] OperatorSymbol vector2[0];
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10189) Create a micro benchmark tool for vectorization to evaluate the performance gain after SIMD optimization

2015-04-12 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491822#comment-14491822
 ] 

Chengxiang Li commented on HIVE-10189:
--

Committed to the trunk, thanks Ferdinad for this contribution.

 Create a micro benchmark tool for vectorization to evaluate the performance 
 gain after SIMD optimization
 

 Key: HIVE-10189
 URL: https://issues.apache.org/jira/browse/HIVE-10189
 Project: Hive
  Issue Type: Sub-task
Reporter: Ferdinand Xu
Assignee: Ferdinand Xu
 Attachments: HIVE-10189.1.patch, HIVE-10189.2.patch, 
 HIVE-10189.patch, avx-64.docx


 We should show the performance gain from SIMD optimization.
 Current score is as follows:
 BenchmarkMode  Samples
   Score   Error  Units
 o.a.h.b.v.VectorizationBench.DoubleAddDoubleExpr.bench   avgt2  
 20719.882 ?  NaN  ns/op
 o.a.h.b.v.VectorizationBench.DoubleAddLongExpr.bench avgt2  
 22216.747 ?  NaN  ns/op
 o.a.h.b.v.VectorizationBench.DoubleDivideDoubleExpr.benchavgt2  
 54319.682 ?  NaN  ns/op
 o.a.h.b.v.VectorizationBench.DoubleDivideLongExpr.bench  avgt2  
 34774.870 ?  NaN  ns/op
 o.a.h.b.v.VectorizationBench.LongAddDoubleExpr.bench avgt2  
 47144.954 ?  NaN  ns/op
 o.a.h.b.v.VectorizationBench.LongAddLongExpr.bench   avgt2  
 21483.787 ?  NaN  ns/op
 o.a.h.b.v.VectorizationBench.LongDivideDoubleExpr.bench  avgt2  
 49765.990 ?  NaN  ns/op
 o.a.h.b.v.VectorizationBench.LongDivideLongExpr.benchavgt2  
 34117.538 ?  NaN  ns/op



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10189) Create a micro benchmark tool for vectorization to evaluate the performance gain after SIMD optimization

2015-04-08 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485243#comment-14485243
 ] 

Chengxiang Li commented on HIVE-10189:
--

+1

 Create a micro benchmark tool for vectorization to evaluate the performance 
 gain after SIMD optimization
 

 Key: HIVE-10189
 URL: https://issues.apache.org/jira/browse/HIVE-10189
 Project: Hive
  Issue Type: Sub-task
Reporter: Ferdinand Xu
Assignee: Ferdinand Xu
 Attachments: HIVE-10189.1.patch, HIVE-10189.2.patch, 
 HIVE-10189.patch, avx-64.docx


 We should show the performance gain from SIMD optimization.
 Current score is as follows:
 BenchmarkMode  Samples
   Score   Error  Units
 o.a.h.b.v.VectorizationBench.DoubleAddDoubleExpr.bench   avgt2  
 20719.882 ?  NaN  ns/op
 o.a.h.b.v.VectorizationBench.DoubleAddLongExpr.bench avgt2  
 22216.747 ?  NaN  ns/op
 o.a.h.b.v.VectorizationBench.DoubleDivideDoubleExpr.benchavgt2  
 54319.682 ?  NaN  ns/op
 o.a.h.b.v.VectorizationBench.DoubleDivideLongExpr.bench  avgt2  
 34774.870 ?  NaN  ns/op
 o.a.h.b.v.VectorizationBench.LongAddDoubleExpr.bench avgt2  
 47144.954 ?  NaN  ns/op
 o.a.h.b.v.VectorizationBench.LongAddLongExpr.bench   avgt2  
 21483.787 ?  NaN  ns/op
 o.a.h.b.v.VectorizationBench.LongDivideDoubleExpr.bench  avgt2  
 49765.990 ?  NaN  ns/op
 o.a.h.b.v.VectorizationBench.LongDivideLongExpr.benchavgt2  
 34117.538 ?  NaN  ns/op



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10180) Loop optimization for SIMD in ColumnArithmeticColumn.txt

2015-04-07 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-10180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482759#comment-14482759
 ] 

Chengxiang Li commented on HIVE-10180:
--

Is your machine support AVX2 instruction set? you can verified this on 
[http://ark.intel.com/]. Besides, java option -XX:UseAvx=number is used to 
control what AVX instruction set would be used during execution.

 Loop optimization for SIMD in ColumnArithmeticColumn.txt
 

 Key: HIVE-10180
 URL: https://issues.apache.org/jira/browse/HIVE-10180
 Project: Hive
  Issue Type: Sub-task
Reporter: Chengxiang Li
Assignee: Chengxiang Li
Priority: Minor
 Attachments: HIVE-10180.1.patch, HIVE-10180.2.patch


 JVM is quite strict on the code schema which may executed with SIMD 
 instructions, take a loop in DoubleColAddDoubleColumn.java for example, 
 {code:java}
 for (int i = 0; i != n; i++) {
   outputVector[i] = vector1[0] + vector2[i];
 }
 {code}
 The vector1[0] reference would prevent JVM to execute this part of code 
 with vectorized instructions, we need to assign the vector1[0] to a 
 variable outside of loop, and use that variable in loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10180) Loop optimization for SIMD in ColumnArithmeticColumn.txt

2015-04-07 Thread Chengxiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-10180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-10180:
-
Attachment: HIVE-10180.2.patch

Set new variables to final.

 Loop optimization for SIMD in ColumnArithmeticColumn.txt
 

 Key: HIVE-10180
 URL: https://issues.apache.org/jira/browse/HIVE-10180
 Project: Hive
  Issue Type: Sub-task
Reporter: Chengxiang Li
Assignee: Chengxiang Li
Priority: Minor
 Attachments: HIVE-10180.1.patch, HIVE-10180.2.patch


 JVM is quite strict on the code schema which may executed with SIMD 
 instructions, take a loop in DoubleColAddDoubleColumn.java for example, 
 {code:java}
 for (int i = 0; i != n; i++) {
   outputVector[i] = vector1[0] + vector2[i];
 }
 {code}
 The vector1[0] reference would prevent JVM to execute this part of code 
 with vectorized instructions, we need to assign the vector1[0] to a 
 variable outside of loop, and use that variable in loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 >

1 - 100 of 120 matches

Mail list logo