[jira] [Updated] (HIVE-12205) Spark: unify spark statististics aggregation between local and remote spark client
[ https://issues.apache.org/jira/browse/HIVE-12205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-12205: - Resolution: Fixed Status: Resolved (was: Patch Available) > Spark: unify spark statististics aggregation between local and remote spark > client > -- > > Key: HIVE-12205 > URL: https://issues.apache.org/jira/browse/HIVE-12205 > Project: Hive > Issue Type: Task > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Chinna Rao Lalam > Attachments: HIVE-12205.1.patch, HIVE-12205.2.patch, > HIVE-12205.3.patch > > > In class {{LocalSparkJobStatus}} and {{RemoteSparkJobStatus}}, spark > statistics aggregation are done similar but in different code paths. Ideally, > we should have a unified approach to simply maintenance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12205) Spark: unify spark statististics aggregation between local and remote spark client
[ https://issues.apache.org/jira/browse/HIVE-12205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150268#comment-15150268 ] Chengxiang Li commented on HIVE-12205: -- Merged to Spark branch, thanks Chinna for this contribution. > Spark: unify spark statististics aggregation between local and remote spark > client > -- > > Key: HIVE-12205 > URL: https://issues.apache.org/jira/browse/HIVE-12205 > Project: Hive > Issue Type: Task > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Chinna Rao Lalam > Attachments: HIVE-12205.1.patch, HIVE-12205.2.patch, > HIVE-12205.3.patch > > > In class {{LocalSparkJobStatus}} and {{RemoteSparkJobStatus}}, spark > statistics aggregation are done similar but in different code paths. Ideally, > we should have a unified approach to simply maintenance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12205) Spark: unify spark statististics aggregation between local and remote spark client
[ https://issues.apache.org/jira/browse/HIVE-12205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146312#comment-15146312 ] Chengxiang Li commented on HIVE-12205: -- +1 > Spark: unify spark statististics aggregation between local and remote spark > client > -- > > Key: HIVE-12205 > URL: https://issues.apache.org/jira/browse/HIVE-12205 > Project: Hive > Issue Type: Task > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Chinna Rao Lalam > Attachments: HIVE-12205.1.patch, HIVE-12205.2.patch > > > In class {{LocalSparkJobStatus}} and {{RemoteSparkJobStatus}}, spark > statistics aggregation are done similar but in different code paths. Ideally, > we should have a unified approach to simply maintenance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12205) Spark: unify spark statististics aggregation between local and remote spark client
[ https://issues.apache.org/jira/browse/HIVE-12205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135776#comment-15135776 ] Chengxiang Li commented on HIVE-12205: -- Thanks, Chinna, i'k 发自我的 iPhone > Spark: unify spark statististics aggregation between local and remote spark > client > -- > > Key: HIVE-12205 > URL: https://issues.apache.org/jira/browse/HIVE-12205 > Project: Hive > Issue Type: Task > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Chinna Rao Lalam > Attachments: HIVE-12205.1.patch, HIVE-12205.2.patch > > > In class {{LocalSparkJobStatus}} and {{RemoteSparkJobStatus}}, spark > statistics aggregation are done similar but in different code paths. Ideally, > we should have a unified approach to simply maintenance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12205) Spark: unify spark statististics aggregation between local and remote spark client
[ https://issues.apache.org/jira/browse/HIVE-12205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135777#comment-15135777 ] Chengxiang Li commented on HIVE-12205: -- Thanks, Chinna, I'm on vocation now, I would review this when I'm back a week later. >From chengxiang's iPhone > Spark: unify spark statististics aggregation between local and remote spark > client > -- > > Key: HIVE-12205 > URL: https://issues.apache.org/jira/browse/HIVE-12205 > Project: Hive > Issue Type: Task > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Chinna Rao Lalam > Attachments: HIVE-12205.1.patch, HIVE-12205.2.patch > > > In class {{LocalSparkJobStatus}} and {{RemoteSparkJobStatus}}, spark > statistics aggregation are done similar but in different code paths. Ideally, > we should have a unified approach to simply maintenance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7142) Hive multi serialization encoding support
[ https://issues.apache.org/jira/browse/HIVE-7142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127486#comment-15127486 ] Chengxiang Li commented on HIVE-7142: - If you want to store data in {{UTF-16}} or {{UTF-32}}, you should set {{serizliation.encoding}} to {{UTF-16}} or {{UTF-32}}. > Hive multi serialization encoding support > - > > Key: HIVE-7142 > URL: https://issues.apache.org/jira/browse/HIVE-7142 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Chengxiang Li >Assignee: Chengxiang Li > Fix For: 0.14.0 > > Attachments: HIVE-7142.1.patch.txt, HIVE-7142.2.patch, > HIVE-7142.3.patch, HIVE-7142.4.patch > > > Currently Hive only support serialize data into UTF-8 charset bytes or > deserialize from UTF-8 bytes, real world users may want to load different > kinds of encoded data into hive directly. This jira is dedicated to support > serialize/deserialize all kinds of encoded data in SerDe layer. > For user, only need to configure serialization encoding on table level by set > serialization encoding through serde parameter, for example: > {code:sql} > CREATE TABLE person(id INT, name STRING, desc STRING)ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH > SERDEPROPERTIES("serialization.encoding"='GBK'); > {code} > or > {code:sql} > ALTER TABLE person SET SERDEPROPERTIES ('serialization.encoding'='GBK'); > {code} > LIMITATIONS: Only LazySimpleSerDe support "serialization.encoding" property > in this patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-12888) TestSparkNegativeCliDriver does not run in Spark mode[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-12888: - Attachment: HIVE-12888.1-spark.patch TestSparkNegativeCliDriver does not add test hive conf dir into its classpath. > TestSparkNegativeCliDriver does not run in Spark mode[Spark Branch] > --- > > Key: HIVE-12888 > URL: https://issues.apache.org/jira/browse/HIVE-12888 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: Chengxiang Li >Assignee: Chengxiang Li > Attachments: HIVE-12888.1-spark.patch > > > During test, i found TestSparkNegativeCliDriver run in MR mode actually, it > should be fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same
[ https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-12736: - Attachment: HIVE-12736.5-spark.patch [~xuefuz], Yes, it's related, i miss something here. Group By before MapJoin is not allowed, and in MR mode, it use {{ReduceSinkOperator}} to check whether there is Group By before MapJoin, it has conflict with Spark mode, as mentioned before. Instead of validate MapJoin compatibility with other Operators by through {{opAllowedBeforeMapJoin()}} and {{opAllowedAfterMapJoin()}}, i should be easier and proper to implement through pattern match, i didn't rewrite the validation for MR mode, just add new validation logic for Spark mode based on pattern match. > It seems that result of Hive on Spark be mistaken and result of Hive and Hive > on Spark are not the same > --- > > Key: HIVE-12736 > URL: https://issues.apache.org/jira/browse/HIVE-12736 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Chengxiang Li > Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch, > HIVE-12736.3-spark.patch, HIVE-12736.4-spark.patch, HIVE-12736.5-spark.patch > > > {code} > select * from staff; > 1 jone22 1 > 2 lucy21 1 > 3 hmm 22 2 > 4 james 24 3 > 5 xiaoliu 23 3 > select id,date_ from trade union all select id,"test" from trade ; > 1 201510210908 > 2 201509080234 > 2 201509080235 > 1 test > 2 test > 2 test > set hive.execution.engine=spark; > set spark.master=local; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > 1 jone22 1 1 201510210908 > 2 lucy21 1 2 201509080234 > 2 lucy21 1 2 201509080235 > set hive.execution.engine=mr; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > FAILED: SemanticException [Error 10227]: Not all clauses are supported with > mapjoin hint. Please remove mapjoin hint. > {code} > I have two questions > 1.Why result of hive on spark not include the following record? > {code} > 1 jone22 1 1 test > 2 lucy21 1 2 test > 2 lucy21 1 2 test > {code} > 2.Why there are two different ways of dealing same query? > explain 1: > {code} > set hive.execution.engine=spark; > set spark.master=local; > explain > select id,date_ from trade union all select id,"test" from trade; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Spark > DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), date_ (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Map 2 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), 'test' (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: >
[jira] [Commented] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same
[ https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106575#comment-15106575 ] Chengxiang Li commented on HIVE-12736: -- Besides, during test, i found TestSparkNegativeCliDriver run in MR mode actually, i would create another JIRA to track it. > It seems that result of Hive on Spark be mistaken and result of Hive and Hive > on Spark are not the same > --- > > Key: HIVE-12736 > URL: https://issues.apache.org/jira/browse/HIVE-12736 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Chengxiang Li > Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch, > HIVE-12736.3-spark.patch, HIVE-12736.4-spark.patch, HIVE-12736.5-spark.patch > > > {code} > select * from staff; > 1 jone22 1 > 2 lucy21 1 > 3 hmm 22 2 > 4 james 24 3 > 5 xiaoliu 23 3 > select id,date_ from trade union all select id,"test" from trade ; > 1 201510210908 > 2 201509080234 > 2 201509080235 > 1 test > 2 test > 2 test > set hive.execution.engine=spark; > set spark.master=local; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > 1 jone22 1 1 201510210908 > 2 lucy21 1 2 201509080234 > 2 lucy21 1 2 201509080235 > set hive.execution.engine=mr; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > FAILED: SemanticException [Error 10227]: Not all clauses are supported with > mapjoin hint. Please remove mapjoin hint. > {code} > I have two questions > 1.Why result of hive on spark not include the following record? > {code} > 1 jone22 1 1 test > 2 lucy21 1 2 test > 2 lucy21 1 2 test > {code} > 2.Why there are two different ways of dealing same query? > explain 1: > {code} > set hive.execution.engine=spark; > set spark.master=local; > explain > select id,date_ from trade union all select id,"test" from trade; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Spark > DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), date_ (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Map 2 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), 'test' (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > {code} > explain 2: > {code} > set hive.execution.engine=spark; > set spark.master=local; > explain > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on >
[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same
[ https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-12736: - Attachment: HIVE-12736.5-spark.patch I can't reproduce the failed mapjoin_memcheck.q locally, upload the patch again to verify. > It seems that result of Hive on Spark be mistaken and result of Hive and Hive > on Spark are not the same > --- > > Key: HIVE-12736 > URL: https://issues.apache.org/jira/browse/HIVE-12736 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Chengxiang Li > Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch, > HIVE-12736.3-spark.patch, HIVE-12736.4-spark.patch, HIVE-12736.5-spark.patch, > HIVE-12736.5-spark.patch > > > {code} > select * from staff; > 1 jone22 1 > 2 lucy21 1 > 3 hmm 22 2 > 4 james 24 3 > 5 xiaoliu 23 3 > select id,date_ from trade union all select id,"test" from trade ; > 1 201510210908 > 2 201509080234 > 2 201509080235 > 1 test > 2 test > 2 test > set hive.execution.engine=spark; > set spark.master=local; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > 1 jone22 1 1 201510210908 > 2 lucy21 1 2 201509080234 > 2 lucy21 1 2 201509080235 > set hive.execution.engine=mr; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > FAILED: SemanticException [Error 10227]: Not all clauses are supported with > mapjoin hint. Please remove mapjoin hint. > {code} > I have two questions > 1.Why result of hive on spark not include the following record? > {code} > 1 jone22 1 1 test > 2 lucy21 1 2 test > 2 lucy21 1 2 test > {code} > 2.Why there are two different ways of dealing same query? > explain 1: > {code} > set hive.execution.engine=spark; > set spark.master=local; > explain > select id,date_ from trade union all select id,"test" from trade; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Spark > DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), date_ (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Map 2 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), 'test' (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > {code} > explain 2: > {code} > set hive.execution.engine=spark; > set spark.master=local; > explain > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id;
[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same
[ https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-12736: - Attachment: HIVE-12736.3-spark.patch Yes, [~xuefuz], {{Operator::opAllowedBeforeMapJoin()}} and {{Operator::opAllowedAfterMapJoin()}} are only used for {{MapJoinProcessor::validateMapJoinTypes()}}, For MR mode, if there are {{ReduceSinkOperator}} before {{MapJoinOperator}}, the {{ReduceSinkOperator}} would be removed from the operator tree, so {{ReduceSinkOperator::opAllowedBeforeMapJoin()}} would never be accessed in MR mode. For Spark mode, only one of two {{ReduceSinkOperator}}s before {{MapJoinOperator}} would be removed, if {{ReduceSinkOperator::opAllowedBeforeMapJoin()}} return false, all the mapjoin with hint would be failed in Spark mode, it actually does not make sense, it should only fail while it's {{UnionOperator}} before {{MapJoinOperator}}. So the change does not influence MR mode, and it's required by Spark mode. Besides, i add negative test for mapjoin with hint. > It seems that result of Hive on Spark be mistaken and result of Hive and Hive > on Spark are not the same > --- > > Key: HIVE-12736 > URL: https://issues.apache.org/jira/browse/HIVE-12736 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Chengxiang Li > Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch, > HIVE-12736.3-spark.patch > > > {code} > select * from staff; > 1 jone22 1 > 2 lucy21 1 > 3 hmm 22 2 > 4 james 24 3 > 5 xiaoliu 23 3 > select id,date_ from trade union all select id,"test" from trade ; > 1 201510210908 > 2 201509080234 > 2 201509080235 > 1 test > 2 test > 2 test > set hive.execution.engine=spark; > set spark.master=local; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > 1 jone22 1 1 201510210908 > 2 lucy21 1 2 201509080234 > 2 lucy21 1 2 201509080235 > set hive.execution.engine=mr; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > FAILED: SemanticException [Error 10227]: Not all clauses are supported with > mapjoin hint. Please remove mapjoin hint. > {code} > I have two questions > 1.Why result of hive on spark not include the following record? > {code} > 1 jone22 1 1 test > 2 lucy21 1 2 test > 2 lucy21 1 2 test > {code} > 2.Why there are two different ways of dealing same query? > explain 1: > {code} > set hive.execution.engine=spark; > set spark.master=local; > explain > select id,date_ from trade union all select id,"test" from trade; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Spark > DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), date_ (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Map 2 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), 'test' (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96
[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same
[ https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-12736: - Attachment: HIVE-12736.4-spark.patch > It seems that result of Hive on Spark be mistaken and result of Hive and Hive > on Spark are not the same > --- > > Key: HIVE-12736 > URL: https://issues.apache.org/jira/browse/HIVE-12736 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Chengxiang Li > Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch, > HIVE-12736.3-spark.patch, HIVE-12736.4-spark.patch > > > {code} > select * from staff; > 1 jone22 1 > 2 lucy21 1 > 3 hmm 22 2 > 4 james 24 3 > 5 xiaoliu 23 3 > select id,date_ from trade union all select id,"test" from trade ; > 1 201510210908 > 2 201509080234 > 2 201509080235 > 1 test > 2 test > 2 test > set hive.execution.engine=spark; > set spark.master=local; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > 1 jone22 1 1 201510210908 > 2 lucy21 1 2 201509080234 > 2 lucy21 1 2 201509080235 > set hive.execution.engine=mr; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > FAILED: SemanticException [Error 10227]: Not all clauses are supported with > mapjoin hint. Please remove mapjoin hint. > {code} > I have two questions > 1.Why result of hive on spark not include the following record? > {code} > 1 jone22 1 1 test > 2 lucy21 1 2 test > 2 lucy21 1 2 test > {code} > 2.Why there are two different ways of dealing same query? > explain 1: > {code} > set hive.execution.engine=spark; > set spark.master=local; > explain > select id,date_ from trade union all select id,"test" from trade; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Spark > DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), date_ (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Map 2 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), 'test' (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > {code} > explain 2: > {code} > set hive.execution.engine=spark; > set spark.master=local; > explain > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > OK > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS:
[jira] [Commented] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same
[ https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15104080#comment-15104080 ] Chengxiang Li commented on HIVE-12736: -- [~xuefuz], would you help to review this patch? > It seems that result of Hive on Spark be mistaken and result of Hive and Hive > on Spark are not the same > --- > > Key: HIVE-12736 > URL: https://issues.apache.org/jira/browse/HIVE-12736 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Chengxiang Li > Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch > > > select * from staff; > 1 jone22 1 > 2 lucy21 1 > 3 hmm 22 2 > 4 james 24 3 > 5 xiaoliu 23 3 > select id,date_ from trade union all select id,"test" from trade ; > 1 201510210908 > 2 201509080234 > 2 201509080235 > 1 test > 2 test > 2 test > set hive.execution.engine=spark; > set spark.master=local; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > 1 jone22 1 1 201510210908 > 2 lucy21 1 2 201509080234 > 2 lucy21 1 2 201509080235 > set hive.execution.engine=mr; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > FAILED: SemanticException [Error 10227]: Not all clauses are supported with > mapjoin hint. Please remove mapjoin hint. > I have two questions > 1.Why result of hive on spark not include the following record? > 1 jone22 1 1 test > 2 lucy21 1 2 test > 2 lucy21 1 2 test > 2.Why there are two different ways of dealing same query? > explain 1: > set hive.execution.engine=spark; > set spark.master=local; > explain > select id,date_ from trade union all select id,"test" from trade; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Spark > DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), date_ (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Map 2 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), 'test' (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > explain 2: > set hive.execution.engine=spark; > set spark.master=local; > explain > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > OK > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > DagName:
[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same
[ https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-12736: - Attachment: HIVE-12736.2-spark.patch > It seems that result of Hive on Spark be mistaken and result of Hive and Hive > on Spark are not the same > --- > > Key: HIVE-12736 > URL: https://issues.apache.org/jira/browse/HIVE-12736 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Chengxiang Li > Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch > > > select * from staff; > 1 jone22 1 > 2 lucy21 1 > 3 hmm 22 2 > 4 james 24 3 > 5 xiaoliu 23 3 > select id,date_ from trade union all select id,"test" from trade ; > 1 201510210908 > 2 201509080234 > 2 201509080235 > 1 test > 2 test > 2 test > set hive.execution.engine=spark; > set spark.master=local; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > 1 jone22 1 1 201510210908 > 2 lucy21 1 2 201509080234 > 2 lucy21 1 2 201509080235 > set hive.execution.engine=mr; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > FAILED: SemanticException [Error 10227]: Not all clauses are supported with > mapjoin hint. Please remove mapjoin hint. > I have two questions > 1.Why result of hive on spark not include the following record? > 1 jone22 1 1 test > 2 lucy21 1 2 test > 2 lucy21 1 2 test > 2.Why there are two different ways of dealing same query? > explain 1: > set hive.execution.engine=spark; > set spark.master=local; > explain > select id,date_ from trade union all select id,"test" from trade; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Spark > DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), date_ (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Map 2 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), 'test' (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > explain 2: > set hive.execution.engine=spark; > set spark.master=local; > explain > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > OK > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > DagName: jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:3 >
[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same
[ https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-12736: - Attachment: HIVE-12736.1-spark.patch {{SparkMapJoinProcessor}} miss some validation during {{convertMapJoin}}, for Spark mode, the query should work the same way as MR. > It seems that result of Hive on Spark be mistaken and result of Hive and Hive > on Spark are not the same > --- > > Key: HIVE-12736 > URL: https://issues.apache.org/jira/browse/HIVE-12736 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Chengxiang Li > Attachments: HIVE-12736.1-spark.patch > > > select * from staff; > 1 jone22 1 > 2 lucy21 1 > 3 hmm 22 2 > 4 james 24 3 > 5 xiaoliu 23 3 > select id,date_ from trade union all select id,"test" from trade ; > 1 201510210908 > 2 201509080234 > 2 201509080235 > 1 test > 2 test > 2 test > set hive.execution.engine=spark; > set spark.master=local; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > 1 jone22 1 1 201510210908 > 2 lucy21 1 2 201509080234 > 2 lucy21 1 2 201509080235 > set hive.execution.engine=mr; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > FAILED: SemanticException [Error 10227]: Not all clauses are supported with > mapjoin hint. Please remove mapjoin hint. > I have two questions > 1.Why result of hive on spark not include the following record? > 1 jone22 1 1 test > 2 lucy21 1 2 test > 2 lucy21 1 2 test > 2.Why there are two different ways of dealing same query? > explain 1: > set hive.execution.engine=spark; > set spark.master=local; > explain > select id,date_ from trade union all select id,"test" from trade; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Spark > DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), date_ (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Map 2 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), 'test' (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > explain 2: > set hive.execution.engine=spark; > set spark.master=local; > explain > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > OK > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage:
[jira] [Commented] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same
[ https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088702#comment-15088702 ] Chengxiang Li commented on HIVE-12736: -- I would work on this issue. > It seems that result of Hive on Spark be mistaken and result of Hive and Hive > on Spark are not the same > --- > > Key: HIVE-12736 > URL: https://issues.apache.org/jira/browse/HIVE-12736 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Chengxiang Li > > select * from staff; > 1 jone22 1 > 2 lucy21 1 > 3 hmm 22 2 > 4 james 24 3 > 5 xiaoliu 23 3 > select id,date_ from trade union all select id,"test" from trade ; > 1 201510210908 > 2 201509080234 > 2 201509080235 > 1 test > 2 test > 2 test > set hive.execution.engine=spark; > set spark.master=local; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > 1 jone22 1 1 201510210908 > 2 lucy21 1 2 201509080234 > 2 lucy21 1 2 201509080235 > set hive.execution.engine=mr; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > FAILED: SemanticException [Error 10227]: Not all clauses are supported with > mapjoin hint. Please remove mapjoin hint. > I have two questions > 1.Why result of hive on spark not include the following record? > 1 jone22 1 1 test > 2 lucy21 1 2 test > 2 lucy21 1 2 test > 2.Why there are two different ways of dealing same query? > explain 1: > set hive.execution.engine=spark; > set spark.master=local; > explain > select id,date_ from trade union all select id,"test" from trade; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Spark > DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), date_ (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Map 2 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), 'test' (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > explain 2: > set hive.execution.engine=spark; > set spark.master=local; > explain > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > OK > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > DagName: jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:3 > Vertices: > Map 1 > Map
[jira] [Assigned] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same
[ https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li reassigned HIVE-12736: Assignee: Chengxiang Li (was: Xuefu Zhang) > It seems that result of Hive on Spark be mistaken and result of Hive and Hive > on Spark are not the same > --- > > Key: HIVE-12736 > URL: https://issues.apache.org/jira/browse/HIVE-12736 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Chengxiang Li > > select * from staff; > 1 jone22 1 > 2 lucy21 1 > 3 hmm 22 2 > 4 james 24 3 > 5 xiaoliu 23 3 > select id,date_ from trade union all select id,"test" from trade ; > 1 201510210908 > 2 201509080234 > 2 201509080235 > 1 test > 2 test > 2 test > set hive.execution.engine=spark; > set spark.master=local; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > 1 jone22 1 1 201510210908 > 2 lucy21 1 2 201509080234 > 2 lucy21 1 2 201509080235 > set hive.execution.engine=mr; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > FAILED: SemanticException [Error 10227]: Not all clauses are supported with > mapjoin hint. Please remove mapjoin hint. > I have two questions > 1.Why result of hive on spark not include the following record? > 1 jone22 1 1 test > 2 lucy21 1 2 test > 2 lucy21 1 2 test > 2.Why there are two different ways of dealing same query? > explain 1: > set hive.execution.engine=spark; > set spark.master=local; > explain > select id,date_ from trade union all select id,"test" from trade; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Spark > DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), date_ (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Map 2 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), 'test' (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > explain 2: > set hive.execution.engine=spark; > set spark.master=local; > explain > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > OK > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > DagName: jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:3 > Vertices: > Map 1 > Map Operator Tree:
[jira] [Commented] (HIVE-12205) Spark: unify spark statististics aggregation between local and remote spark client
[ https://issues.apache.org/jira/browse/HIVE-12205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088795#comment-15088795 ] Chengxiang Li commented on HIVE-12205: -- [~chinnalalam], thanks working on this. In your patch, the statistic aggregation is still computed separately in different methods(although in same class now) for {{LocalSparkJobStatus}} and {{RemoteSparkJobStatus}}, i suggest you can add a initialize method in {{MetrisCollection}} with parameter {{String jobId, MapjobMetrics}}, so that {{LocalSparkJobStatus}} can reuse {{MetricsCollection}} to aggregate statistics as well. What do you think? Besides, could you create a ticket on RB for this? > Spark: unify spark statististics aggregation between local and remote spark > client > -- > > Key: HIVE-12205 > URL: https://issues.apache.org/jira/browse/HIVE-12205 > Project: Hive > Issue Type: Task > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Chinna Rao Lalam > Attachments: HIVE-12205.1.patch > > > In class {{LocalSparkJobStatus}} and {{RemoteSparkJobStatus}}, spark > statistics aggregation are done similar but in different code paths. Ideally, > we should have a unified approach to simply maintenance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12569) Excessive console message from SparkClientImpl [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-12569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037187#comment-15037187 ] Chengxiang Li commented on HIVE-12569: -- As [~nemon] analyzed, the message comes from Spark side, it should be spark get stuck in {{org.apache.spark.deploy.yarn.Client::monitorApplication}} as it never get end state. Looks like a spark issue. > Excessive console message from SparkClientImpl [Spark Branch] > - > > Key: HIVE-12569 > URL: https://issues.apache.org/jira/browse/HIVE-12569 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 2.0.0 >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang >Priority: Blocker > > {code} > 15/12/02 11:00:46 INFO client.SparkClientImpl: 15/12/02 11:00:46 INFO Client: > Application report for application_1442517343449_0038 (state: RUNNING) > 15/12/02 11:00:47 INFO client.SparkClientImpl: 15/12/02 11:00:47 INFO Client: > Application report for application_1442517343449_0038 (state: RUNNING) > 15/12/02 11:00:48 INFO client.SparkClientImpl: 15/12/02 11:00:48 INFO Client: > Application report for application_1442517343449_0038 (state: RUNNING) > 15/12/02 11:00:49 INFO client.SparkClientImpl: 15/12/02 11:00:49 INFO Client: > Application report for application_1442517343449_0038 (state: RUNNING) > 15/12/02 11:00:50 INFO client.SparkClientImpl: 15/12/02 11:00:50 INFO Client: > Application report for application_1442517343449_0038 (state: RUNNING) > {code} > I see this using Hive CLI after a spark job is launched and it goes > non-stopping even if the job is finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12515) Clean the SparkCounters related code after remove counter based stats collection[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-12515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037160#comment-15037160 ] Chengxiang Li commented on HIVE-12515: -- LGTM BTW, if i recall this right, the operator level stats is not used anywhere but get printed to console or log for user information. I think it's the right decision to keep this. > Clean the SparkCounters related code after remove counter based stats > collection[Spark Branch] > -- > > Key: HIVE-12515 > URL: https://issues.apache.org/jira/browse/HIVE-12515 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Chengxiang Li >Assignee: Rui Li > Attachments: HIVE-12515.1-spark.patch, HIVE-12515.2-spark.patch > > > As SparkCounters is only used to collection stats, after HIVE-12411, we does > not need it anymore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12569) Excessive console message from SparkClientImpl [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-12569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037299#comment-15037299 ] Chengxiang Li commented on HIVE-12569: -- Actually, it's not a exactly issue, although the spark job finished, the spark application indeed still alive, so the reported state is right, we just do not want to print it on CLI console. change the log level should be the simplest solution. > Excessive console message from SparkClientImpl [Spark Branch] > - > > Key: HIVE-12569 > URL: https://issues.apache.org/jira/browse/HIVE-12569 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 2.0.0 >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang >Priority: Blocker > > {code} > 15/12/02 11:00:46 INFO client.SparkClientImpl: 15/12/02 11:00:46 INFO Client: > Application report for application_1442517343449_0038 (state: RUNNING) > 15/12/02 11:00:47 INFO client.SparkClientImpl: 15/12/02 11:00:47 INFO Client: > Application report for application_1442517343449_0038 (state: RUNNING) > 15/12/02 11:00:48 INFO client.SparkClientImpl: 15/12/02 11:00:48 INFO Client: > Application report for application_1442517343449_0038 (state: RUNNING) > 15/12/02 11:00:49 INFO client.SparkClientImpl: 15/12/02 11:00:49 INFO Client: > Application report for application_1442517343449_0038 (state: RUNNING) > 15/12/02 11:00:50 INFO client.SparkClientImpl: 15/12/02 11:00:50 INFO Client: > Application report for application_1442517343449_0038 (state: RUNNING) > {code} > I see this using Hive CLI after a spark job is launched and it goes > non-stopping even if the job is finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12515) Clean the SparkCounters related code after remove counter based stats collection[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-12515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029680#comment-15029680 ] Chengxiang Li commented on HIVE-12515: -- {{SparkCounters}} is referred in lots of classes in HoS, not sure how many code changes since last merge with master, we may got many conflicts during merging if remove {{SparkCounters}} in master. I think we can just do this in spark branch, although {{org.apache.hadoop.hive.ql.stats.CounterStatsAggregatorSpark}} has been removed, it should be a quite simple conflict during merge. > Clean the SparkCounters related code after remove counter based stats > collection[Spark Branch] > -- > > Key: HIVE-12515 > URL: https://issues.apache.org/jira/browse/HIVE-12515 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Chengxiang Li >Assignee: Xuefu Zhang > > As SparkCounters is only used to collection stats, after HIVE-12411, we does > not need it anymore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12515) Clean the SparkCounters related code after remove counter based stats collection[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-12515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029540#comment-15029540 ] Chengxiang Li commented on HIVE-12515: -- [~lirui], the {{org.apache.hadoop.hive.ql.stats.CounterStatsAggregatorSpark}} is configured with class name, in a Dynamic Injection style, so there is no dependency on compile time, it should be safe to remove. > Clean the SparkCounters related code after remove counter based stats > collection[Spark Branch] > -- > > Key: HIVE-12515 > URL: https://issues.apache.org/jira/browse/HIVE-12515 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Chengxiang Li >Assignee: Xuefu Zhang > > As SparkCounters is only used to collection stats, after HIVE-12411, we does > not need it anymore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12466) SparkCounter not initialized error
[ https://issues.apache.org/jira/browse/HIVE-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025999#comment-15025999 ] Chengxiang Li commented on HIVE-12466: -- SparkCounters is only used for stats collection now, so yes, i think we may not need SparkCounters anymore if counter-based stats collection is removed. As far as i know, there is no other Hive features which depends on SparkCounters. > SparkCounter not initialized error > -- > > Key: HIVE-12466 > URL: https://issues.apache.org/jira/browse/HIVE-12466 > Project: Hive > Issue Type: Bug > Components: Spark >Reporter: Rui Li >Assignee: Rui Li > Attachments: HIVE-12466.1-spark.patch > > > During a query, lots of the following error found in executor's log: > {noformat} > 03:47:28.759 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] > has not initialized before. > 03:47:28.762 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] > has not initialized before. > 03:47:30.707 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.tmp_tmp] has not initialized before. > 03:47:33.385 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:33.388 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:33.495 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:35.141 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12466) SparkCounter not initialized error
[ https://issues.apache.org/jira/browse/HIVE-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026034#comment-15026034 ] Chengxiang Li commented on HIVE-12466: -- Yes, it does, at least at the time i implemented the counter-based stats collection for Spark, it does not relate to any part of our work on HoS, so i assume it should work just as well now. > SparkCounter not initialized error > -- > > Key: HIVE-12466 > URL: https://issues.apache.org/jira/browse/HIVE-12466 > Project: Hive > Issue Type: Bug > Components: Spark >Reporter: Rui Li >Assignee: Rui Li > Attachments: HIVE-12466.1-spark.patch > > > During a query, lots of the following error found in executor's log: > {noformat} > 03:47:28.759 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] > has not initialized before. > 03:47:28.762 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] > has not initialized before. > 03:47:30.707 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.tmp_tmp] has not initialized before. > 03:47:33.385 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:33.388 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:33.495 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:35.141 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12466) SparkCounter not initialized error
[ https://issues.apache.org/jira/browse/HIVE-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026081#comment-15026081 ] Chengxiang Li commented on HIVE-12466: -- Committed to spark branch, thanks Rui for this contribution. > SparkCounter not initialized error > -- > > Key: HIVE-12466 > URL: https://issues.apache.org/jira/browse/HIVE-12466 > Project: Hive > Issue Type: Bug > Components: Spark >Reporter: Rui Li >Assignee: Rui Li > Attachments: HIVE-12466.1-spark.patch > > > During a query, lots of the following error found in executor's log: > {noformat} > 03:47:28.759 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] > has not initialized before. > 03:47:28.762 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] > has not initialized before. > 03:47:30.707 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.tmp_tmp] has not initialized before. > 03:47:33.385 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:33.388 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:33.495 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:35.141 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12466) SparkCounter not initialized error
[ https://issues.apache.org/jira/browse/HIVE-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026089#comment-15026089 ] Chengxiang Li commented on HIVE-12466: -- HIVE-12515 is created for the following cleanup work. > SparkCounter not initialized error > -- > > Key: HIVE-12466 > URL: https://issues.apache.org/jira/browse/HIVE-12466 > Project: Hive > Issue Type: Bug > Components: Spark >Reporter: Rui Li >Assignee: Rui Li > Attachments: HIVE-12466.1-spark.patch > > > During a query, lots of the following error found in executor's log: > {noformat} > 03:47:28.759 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] > has not initialized before. > 03:47:28.762 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] > has not initialized before. > 03:47:30.707 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.tmp_tmp] has not initialized before. > 03:47:33.385 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:33.388 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:33.495 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:35.141 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12466) SparkCounter not initialized error
[ https://issues.apache.org/jira/browse/HIVE-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024042#comment-15024042 ] Chengxiang Li commented on HIVE-12466: -- LGTM, wait for the testing. > SparkCounter not initialized error > -- > > Key: HIVE-12466 > URL: https://issues.apache.org/jira/browse/HIVE-12466 > Project: Hive > Issue Type: Bug > Components: Spark >Reporter: Rui Li >Assignee: Rui Li > Attachments: HIVE-12466.1-spark.patch > > > During a query, lots of the following error found in executor's log: > {noformat} > 03:47:28.759 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] > has not initialized before. > 03:47:28.762 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] > has not initialized before. > 03:47:30.707 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.tmp_tmp] has not initialized before. > 03:47:33.385 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:33.388 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:33.495 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:35.141 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12466) SparkCounter not initialized error
[ https://issues.apache.org/jira/browse/HIVE-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15023574#comment-15023574 ] Chengxiang Li commented on HIVE-12466: -- Yes, [~lirui], the suffix is available in the operator conf. As recently i didn't work on HoS, it would take some time to prepare a test environment, do you mind to give a quick fix on this issue? i can do the review work. > SparkCounter not initialized error > -- > > Key: HIVE-12466 > URL: https://issues.apache.org/jira/browse/HIVE-12466 > Project: Hive > Issue Type: Bug > Components: Spark >Reporter: Rui Li >Assignee: Xuefu Zhang > > During a query, lots of the following error found in executor's log: > {noformat} > 03:47:28.759 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] > has not initialized before. > 03:47:28.762 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] > has not initialized before. > 03:47:30.707 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.tmp_tmp] has not initialized before. > 03:47:33.385 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:33.388 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:33.495 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:35.141 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12466) SparkCounter not initialized error
[ https://issues.apache.org/jira/browse/HIVE-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021611#comment-15021611 ] Chengxiang Li commented on HIVE-12466: -- [~xuefuz], due to the limitation of Spark accumulator, the `SparkCounter` has to register the counter name before the job execution. The error message shows that specified counter name is not registered before. In default, all the default spark counters are collected with `SparkTask::getCounterPrefixes()`, `RECORDS_OUT_0`, `RECORDS_OUT_1_default.tmp_tmp` and `RECORDS_OUT_1_default.test_table` are not included, seems the counter logic changes in `ReduceSinkOperator` and 'FileSinkOperator', we need to update the logic of `SparkTask::getOperatorCounters`. > SparkCounter not initialized error > -- > > Key: HIVE-12466 > URL: https://issues.apache.org/jira/browse/HIVE-12466 > Project: Hive > Issue Type: Bug > Components: Spark >Reporter: Rui Li >Assignee: Xuefu Zhang > > During a query, lots of the following error found in executor's log: > {noformat} > 03:47:28.759 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] > has not initialized before. > 03:47:28.762 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, RECORDS_OUT_0] > has not initialized before. > 03:47:30.707 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.tmp_tmp] has not initialized before. > 03:47:33.385 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:33.388 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:33.495 [Executor task launch worker-0] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > 03:47:35.141 [Executor task launch worker-1] ERROR > org.apache.hive.spark.counter.SparkCounters - counter[HIVE, > RECORDS_OUT_1_default.test_table] has not initialized before. > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11533) Loop optimization for SIMD in integer comparisons
[ https://issues.apache.org/jira/browse/HIVE-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954723#comment-14954723 ] Chengxiang Li commented on HIVE-11533: -- Committed to master branch, thanks for the contribution, [~teddy.choi]. > Loop optimization for SIMD in integer comparisons > - > > Key: HIVE-11533 > URL: https://issues.apache.org/jira/browse/HIVE-11533 > Project: Hive > Issue Type: Sub-task > Components: Vectorization >Reporter: Teddy Choi >Assignee: Teddy Choi >Priority: Minor > Attachments: HIVE-11533.1.patch, HIVE-11533.2.patch, > HIVE-11533.3.patch, HIVE-11533.4.patch, HIVE-11533.5.patch > > > Long*CompareLong* classes can be optimized with subtraction and bitwise > operators for better SIMD optimization. > {code} > for(int i = 0; i != n; i++) { > outputVector[i] = vector1[0] > vector2[i] ? 1 : 0; > } > {code} > This issue will cover following classes; > - LongColEqualLongColumn > - LongColNotEqualLongColumn > - LongColGreaterLongColumn > - LongColGreaterEqualLongColumn > - LongColLessLongColumn > - LongColLessEqualLongColumn > - LongScalarEqualLongColumn > - LongScalarNotEqualLongColumn > - LongScalarGreaterLongColumn > - LongScalarGreaterEqualLongColumn > - LongScalarLessLongColumn > - LongScalarLessEqualLongColumn > - LongColEqualLongScalar > - LongColNotEqualLongScalar > - LongColGreaterLongScalar > - LongColGreaterEqualLongScalar > - LongColLessLongScalar > - LongColLessEqualLongScalar -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-11533) Loop optimization for SIMD in integer comparisons
[ https://issues.apache.org/jira/browse/HIVE-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-11533: - Fix Version/s: 2.0.0 > Loop optimization for SIMD in integer comparisons > - > > Key: HIVE-11533 > URL: https://issues.apache.org/jira/browse/HIVE-11533 > Project: Hive > Issue Type: Sub-task > Components: Vectorization >Reporter: Teddy Choi >Assignee: Teddy Choi >Priority: Minor > Fix For: 2.0.0 > > Attachments: HIVE-11533.1.patch, HIVE-11533.2.patch, > HIVE-11533.3.patch, HIVE-11533.4.patch, HIVE-11533.5.patch > > > Long*CompareLong* classes can be optimized with subtraction and bitwise > operators for better SIMD optimization. > {code} > for(int i = 0; i != n; i++) { > outputVector[i] = vector1[0] > vector2[i] ? 1 : 0; > } > {code} > This issue will cover following classes; > - LongColEqualLongColumn > - LongColNotEqualLongColumn > - LongColGreaterLongColumn > - LongColGreaterEqualLongColumn > - LongColLessLongColumn > - LongColLessEqualLongColumn > - LongScalarEqualLongColumn > - LongScalarNotEqualLongColumn > - LongScalarGreaterLongColumn > - LongScalarGreaterEqualLongColumn > - LongScalarLessLongColumn > - LongScalarLessEqualLongColumn > - LongColEqualLongScalar > - LongColNotEqualLongScalar > - LongColGreaterLongScalar > - LongColGreaterEqualLongScalar > - LongColLessLongScalar > - LongColLessEqualLongScalar -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11533) Loop optimization for SIMD in integer comparisons
[ https://issues.apache.org/jira/browse/HIVE-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952519#comment-14952519 ] Chengxiang Li commented on HIVE-11533: -- +1 > Loop optimization for SIMD in integer comparisons > - > > Key: HIVE-11533 > URL: https://issues.apache.org/jira/browse/HIVE-11533 > Project: Hive > Issue Type: Sub-task > Components: Vectorization >Reporter: Teddy Choi >Assignee: Teddy Choi >Priority: Minor > Attachments: HIVE-11533.1.patch, HIVE-11533.2.patch, > HIVE-11533.3.patch, HIVE-11533.4.patch, HIVE-11533.5.patch > > > Long*CompareLong* classes can be optimized with subtraction and bitwise > operators for better SIMD optimization. > {code} > for(int i = 0; i != n; i++) { > outputVector[i] = vector1[0] > vector2[i] ? 1 : 0; > } > {code} > This issue will cover following classes; > - LongColEqualLongColumn > - LongColNotEqualLongColumn > - LongColGreaterLongColumn > - LongColGreaterEqualLongColumn > - LongColLessLongColumn > - LongColLessEqualLongColumn > - LongScalarEqualLongColumn > - LongScalarNotEqualLongColumn > - LongScalarGreaterLongColumn > - LongScalarGreaterEqualLongColumn > - LongScalarLessLongColumn > - LongScalarLessEqualLongColumn > - LongColEqualLongScalar > - LongColNotEqualLongScalar > - LongColGreaterLongScalar > - LongColGreaterEqualLongScalar > - LongColLessLongScalar > - LongColLessEqualLongScalar -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11533) Loop optimization for SIMD in integer comparisons
[ https://issues.apache.org/jira/browse/HIVE-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947998#comment-14947998 ] Chengxiang Li commented on HIVE-11533: -- Very nice job, the patch looks good, just one thing to remind. I guess the performance data is tested with "selectedInUse" is false. while "selectedInUse" is true, it could not benefit from SIMD instructions, during my previous experience, it might downgrade performance sometimes after the optimization, have you verified that? > Loop optimization for SIMD in integer comparisons > - > > Key: HIVE-11533 > URL: https://issues.apache.org/jira/browse/HIVE-11533 > Project: Hive > Issue Type: Sub-task > Components: Vectorization >Reporter: Teddy Choi >Assignee: Teddy Choi >Priority: Minor > Attachments: HIVE-11533.1.patch, HIVE-11533.2.patch, > HIVE-11533.3.patch, HIVE-11533.4.patch > > > Long*CompareLong* classes can be optimized with subtraction and bitwise > operators for better SIMD optimization. > {code} > for(int i = 0; i != n; i++) { > outputVector[i] = vector1[0] > vector2[i] ? 1 : 0; > } > {code} > This issue will cover following classes; > - LongColEqualLongColumn > - LongColNotEqualLongColumn > - LongColGreaterLongColumn > - LongColGreaterEqualLongColumn > - LongColLessLongColumn > - LongColLessEqualLongColumn > - LongScalarEqualLongColumn > - LongScalarNotEqualLongColumn > - LongScalarGreaterLongColumn > - LongScalarGreaterEqualLongColumn > - LongScalarLessLongColumn > - LongScalarLessEqualLongColumn > - LongColEqualLongScalar > - LongColNotEqualLongScalar > - LongColGreaterLongScalar > - LongColGreaterEqualLongScalar > - LongColLessLongScalar > - LongColLessEqualLongScalar -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10238) Loop optimization for SIMD in IfExprColumnColumn.txt
[ https://issues.apache.org/jira/browse/HIVE-10238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706287#comment-14706287 ] Chengxiang Li commented on HIVE-10238: -- +1 Loop optimization for SIMD in IfExprColumnColumn.txt Key: HIVE-10238 URL: https://issues.apache.org/jira/browse/HIVE-10238 Project: Hive Issue Type: Sub-task Components: Vectorization Affects Versions: 1.1.0 Reporter: Chengxiang Li Assignee: Teddy Choi Priority: Minor Attachments: HIVE-10238.2.patch, HIVE-10238.patch The ?: operator as following could not be vectorized in loop, we may transfer it into mathematical expression. {code:java} for(int j = 0; j != n; j++) { int i = sel[j]; outputVector[i] = (vector1[i] == 1 ? vector2[i] : vector3[i]); outputIsNull[i] = (vector1[i] == 1 ? arg2ColVector.isNull[i] : arg3ColVector.isNull[i]); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10179) Optimization for SIMD instructions in Hive
[ https://issues.apache.org/jira/browse/HIVE-10179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681520#comment-14681520 ] Chengxiang Li commented on HIVE-10179: -- Yes, [~teddy.choi], it's welcome, feel free to create new subtasks and make contribution. Optimization for SIMD instructions in Hive -- Key: HIVE-10179 URL: https://issues.apache.org/jira/browse/HIVE-10179 Project: Hive Issue Type: Improvement Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: optimization [SIMD|http://en.wikipedia.org/wiki/SIMD] instuctions could be found in most of current CPUs, such as Intel's SSE2, SSE3, SSE4.x, AVX and AVX2, and it would help Hive to outperform if we can vectorize the mathematical manipulation part of Hive. This umbrella JIRA may contains but not limited to the subtasks like: # Code schema adaption, current JVM is quite strictly on the code schema which could be transformed into SIMD instructions during execution. # New implementation of mathematical manipulation part of Hive which designed to be optimized for SIMD instructions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10238) Loop optimization for SIMD in IfExprColumnColumn.txt
[ https://issues.apache.org/jira/browse/HIVE-10238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681506#comment-14681506 ] Chengxiang Li commented on HIVE-10238: -- Thanks for looking at this, [~teddy.choi]. I tried to use bitwise operators before as well, and found it does not perform better, but i'm not sure whether it's same with yours, you add new benchmark tests to verify that. If you are interesting in this issue, please feel free to reassign it to yourself and keep working on it. Loop optimization for SIMD in IfExprColumnColumn.txt Key: HIVE-10238 URL: https://issues.apache.org/jira/browse/HIVE-10238 Project: Hive Issue Type: Sub-task Components: Vectorization Affects Versions: 1.1.0 Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor The ?: operator as following could not be vectorized in loop, we may transfer it into mathematical expression. {code:java} for(int j = 0; j != n; j++) { int i = sel[j]; outputVector[i] = (vector1[i] == 1 ? vector2[i] : vector3[i]); outputIsNull[i] = (vector1[i] == 1 ? arg2ColVector.isNull[i] : arg3ColVector.isNull[i]); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10238) Loop optimization for SIMD in IfExprColumnColumn.txt
[ https://issues.apache.org/jira/browse/HIVE-10238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692798#comment-14692798 ] Chengxiang Li commented on HIVE-10238: -- Hi, [~teddy.choi], could you upload the patch to Review Board for better review? Loop optimization for SIMD in IfExprColumnColumn.txt Key: HIVE-10238 URL: https://issues.apache.org/jira/browse/HIVE-10238 Project: Hive Issue Type: Sub-task Components: Vectorization Affects Versions: 1.1.0 Reporter: Chengxiang Li Assignee: Teddy Choi Priority: Minor Attachments: HIVE-10238.2.patch, HIVE-10238.patch The ?: operator as following could not be vectorized in loop, we may transfer it into mathematical expression. {code:java} for(int j = 0; j != n; j++) { int i = sel[j]; outputVector[i] = (vector1[i] == 1 ? vector2[i] : vector3[i]); outputIsNull[i] = (vector1[i] == 1 ? arg2ColVector.isNull[i] : arg3ColVector.isNull[i]); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11276) Optimization around job submission and adding jars [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630682#comment-14630682 ] Chengxiang Li commented on HIVE-11276: -- [~xuefuz], I review the the code in RemoteHiveSparkClient, the reason why it need to invoke refreshLocalResources() for every job submission is that Hive user may use ADD \[FILE|JAR|ARCHIVE\] value command to add resources on runtime, so spark client need to upload these resources to spark cluster before job execution. RemoteHiveSparkClient have a list which records all the resources it has uploaded to spark cluster, and use it to filter out already uploaded jars during refreshLocalResources(), only new added jars would be uploaded to spark cluster, and the list should have a quite small size at most time, so i think it should not has performance issue here. Optimization around job submission and adding jars [Spark Branch] - Key: HIVE-11276 URL: https://issues.apache.org/jira/browse/HIVE-11276 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: 1.1.0 Reporter: Xuefu Zhang Assignee: Chengxiang Li It seems that Hive on Spark has some room for performance improvement on job submission. Specifically, we are calling refreshLocalResources() for every job submission despite there is are no changes in the jar list. Since Hive on Spark is reusing the containers in the whole user session, we might be able to optimize that. We do need to take into consideration the case of dynamic allocation, in which new executors might be added. This task is some RD in this area. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11276) Optimization around job submission and adding jars [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630791#comment-14630791 ] Chengxiang Li commented on HIVE-11276: -- That make sense to me, launch the spark cluster during first query execution would mislead the user that Hive on Spark is slower than it actually does. Besides, we may also open spark session while user set hive.execution.engine to spark. Optimization around job submission and adding jars [Spark Branch] - Key: HIVE-11276 URL: https://issues.apache.org/jira/browse/HIVE-11276 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: 1.1.0 Reporter: Xuefu Zhang Assignee: Chengxiang Li It seems that Hive on Spark has some room for performance improvement on job submission. Specifically, we are calling refreshLocalResources() for every job submission despite there is are no changes in the jar list. Since Hive on Spark is reusing the containers in the whole user session, we might be able to optimize that. We do need to take into consideration the case of dynamic allocation, in which new executors might be added. This task is some RD in this area. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HIVE-11276) Optimization around job submission and adding jars [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li reassigned HIVE-11276: Assignee: Chengxiang Li Optimization around job submission and adding jars [Spark Branch] - Key: HIVE-11276 URL: https://issues.apache.org/jira/browse/HIVE-11276 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: 1.1.0 Reporter: Xuefu Zhang Assignee: Chengxiang Li It seems that Hive on Spark has some room for performance improvement on job submission. Specifically, we are calling refreshLocalResources() for every job submission despite there is are no changes in the jar list. Since Hive on Spark is reusing the containers in the whole user session, we might be able to optimize that. We do need to take into consideration the case of dynamic allocation, in which new executors might be added. This task is some RD in this area. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11276) Optimization around job submission and adding jars [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630689#comment-14630689 ] Chengxiang Li commented on HIVE-11276: -- Besides, for the case of dynamic allocation, i'm not sure whether it would be influenced by this. From my point of view, as we use Spark API like SparkContext::addJar()/addFile() to upload resources to SparkCluster, after that, it should be Spark's responsibility to make sure it's executor JVM load these resources. From the experience of my previous test of dynamic allocation, everything works well. Optimization around job submission and adding jars [Spark Branch] - Key: HIVE-11276 URL: https://issues.apache.org/jira/browse/HIVE-11276 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: 1.1.0 Reporter: Xuefu Zhang Assignee: Chengxiang Li It seems that Hive on Spark has some room for performance improvement on job submission. Specifically, we are calling refreshLocalResources() for every job submission despite there is are no changes in the jar list. Since Hive on Spark is reusing the containers in the whole user session, we might be able to optimize that. We do need to take into consideration the case of dynamic allocation, in which new executors might be added. This task is some RD in this area. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11267) Combine equavilent leaf works in SparkWork[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629069#comment-14629069 ] Chengxiang Li commented on HIVE-11267: -- [~xuefuz], I took a look at the FileSinkOperator implementation before, the write logic is quite complicated, and write multi times would break several its design rules. I don't want to change FileSinkOperator a lot for this special case optimization. Fetch twice would be just few lines of code change and more efficient(SparkWork only write once). Actually we can check the exists of FetchTask, if it does not exist, we can skip this optimization. Combine equavilent leaf works in SparkWork[Spark Branch] Key: HIVE-11267 URL: https://issues.apache.org/jira/browse/HIVE-11267 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor There could be multi leaf works in SparkWork, like self-union query. If the subqueries are same with each other, we may combine the subqueries, and just execute once, then fetch twice in FetchTask. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-11082) Support multi edge between nodes in SparkPlan[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-11082: - Attachment: HIVE-11082.3-spark.patch fix nit format issue. Support multi edge between nodes in SparkPlan[Spark Branch] --- Key: HIVE-11082 URL: https://issues.apache.org/jira/browse/HIVE-11082 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-11082.1-spark.patch, HIVE-11082.2-spark.patch, HIVE-11082.3-spark.patch For Dynamic RDD caching optimization, we found SparkPlan::connect throw exception while we try to combine 2 works with same child, support multi edge between nodes in SparkPlan would help to enable dynamic RDD caching in more use cases, like self join and self union. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-11082) Support multi edge between nodes in SparkPlan[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-11082: - Fix Version/s: spark-branch Support multi edge between nodes in SparkPlan[Spark Branch] --- Key: HIVE-11082 URL: https://issues.apache.org/jira/browse/HIVE-11082 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Fix For: spark-branch Attachments: HIVE-11082.1-spark.patch, HIVE-11082.2-spark.patch, HIVE-11082.3-spark.patch For Dynamic RDD caching optimization, we found SparkPlan::connect throw exception while we try to combine 2 works with same child, support multi edge between nodes in SparkPlan would help to enable dynamic RDD caching in more use cases, like self join and self union. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11204) Research on recent failed qtests[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627641#comment-14627641 ] Chengxiang Li commented on HIVE-11204: -- Ok, I would left this issue open until next merge from master, and verify after the merge. Research on recent failed qtests[Spark Branch] -- Key: HIVE-11204 URL: https://issues.apache.org/jira/browse/HIVE-11204 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor Attachments: HIVE-11204.1-spark.patch, HIVE-11204.1-spark.patch Found some strange failed qtests in HIVE-11053 Hive QA, as it's pretty sure that failed qtests are not related to HIVE-11053 patch, so just reproduce and research it here. Failed tests: org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_bigdata org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_resolution org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_sort_1_23 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join_literals org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_mapreduce1 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_skewjoinopt2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_smb_mapjoin_15 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_19 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_4 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_8 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_view -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-11082) Support multi edge between nodes in SparkPlan[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-11082: - Attachment: HIVE-11082.2-spark.patch update related qtest output. Support multi edge between nodes in SparkPlan[Spark Branch] --- Key: HIVE-11082 URL: https://issues.apache.org/jira/browse/HIVE-11082 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-11082.1-spark.patch, HIVE-11082.2-spark.patch For Dynamic RDD caching optimization, we found SparkPlan::connect throw exception while we try to combine 2 works with same child, support multi edge between nodes in SparkPlan would help to enable dynamic RDD caching in more use cases, like self join and self union. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-11204) Research on recent failed qtests[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-11204: - Attachment: HIVE-11204.1-spark.patch Research on recent failed qtests[Spark Branch] -- Key: HIVE-11204 URL: https://issues.apache.org/jira/browse/HIVE-11204 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor Attachments: HIVE-11204.1-spark.patch, HIVE-11204.1-spark.patch Found some strange failed qtests in HIVE-11053 Hive QA, as it's pretty sure that failed qtests are not related to HIVE-11053 patch, so just reproduce and research it here. Failed tests: org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_bigdata org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_resolution org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_sort_1_23 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join_literals org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_mapreduce1 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_skewjoinopt2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_smb_mapjoin_15 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_19 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_4 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_8 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_view -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-11082) Support multi edge between nodes in SparkPlan[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-11082: - Attachment: HIVE-11082.1-spark.patch SparkPlan support multi edges between nodes in default, just remove the check during SparkPlan::connect. But self join/union does not benifit from RDD caching with this patch actually, as self join/union would set different alia names to the source table, which make the ReduceSinkOperators in different MapWork do not equals with each other. Support multi edge between nodes in SparkPlan[Spark Branch] --- Key: HIVE-11082 URL: https://issues.apache.org/jira/browse/HIVE-11082 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-11082.1-spark.patch For Dynamic RDD caching optimization, we found SparkPlan::connect throw exception while we try to combine 2 works with same child, support multi edge between nodes in SparkPlan would help to enable dynamic RDD caching in more use cases, like self join and self union. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11082) Support multi edge between nodes in SparkPlan[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627414#comment-14627414 ] Chengxiang Li commented on HIVE-11082: -- There seems some failed test, [~xuefuz], i would check what's going on. Support multi edge between nodes in SparkPlan[Spark Branch] --- Key: HIVE-11082 URL: https://issues.apache.org/jira/browse/HIVE-11082 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-11082.1-spark.patch For Dynamic RDD caching optimization, we found SparkPlan::connect throw exception while we try to combine 2 works with same child, support multi edge between nodes in SparkPlan would help to enable dynamic RDD caching in more use cases, like self join and self union. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11204) Research on recent failed qtests[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627413#comment-14627413 ] Chengxiang Li commented on HIVE-11204: -- [~xuefuz], all the above initializationError is due some kind of file missing issue, like: {code:java} java.io.FileNotFoundException: /data/hive-ptest/working/apache-git-source-source/itests/qtest/target/generated-test-sources/java/org/apache/hadoop/hive/cli/TestCliDriverQFileNames.txt {code} These files should be generated during the maven generate-test-sources phase, I can not reproduce these issue on my local environment, it does not looks like Hive logic error, do you have any idea why these issue happens? Research on recent failed qtests[Spark Branch] -- Key: HIVE-11204 URL: https://issues.apache.org/jira/browse/HIVE-11204 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor Attachments: HIVE-11204.1-spark.patch, HIVE-11204.1-spark.patch Found some strange failed qtests in HIVE-11053 Hive QA, as it's pretty sure that failed qtests are not related to HIVE-11053 patch, so just reproduce and research it here. Failed tests: org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_bigdata org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_resolution org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_sort_1_23 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join_literals org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_mapreduce1 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_skewjoinopt2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_smb_mapjoin_15 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_19 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_4 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_8 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_view -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11082) Support multi edge between nodes in SparkPlan[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627405#comment-14627405 ] Chengxiang Li commented on HIVE-11082: -- It's easy to ignore the alias name during the comparison, what stop me doing that is the execution logic afterword. The following operators distinguish different inputs by the alias name, as there different table logically, we would lose the alias information if combine the MapWorks. One possible optimization is cut the ReduceSinkOperator into a separate MapWork, so that we could cache the previous MapWork which include the operator chain before ReduceSinkOperator. This optimization require Hive on Spark support appendable MapWork, like MapWork -- MapWork -- ReuceWork, or MapWork -- ReduceWork -- MapWork. Support multi edge between nodes in SparkPlan[Spark Branch] --- Key: HIVE-11082 URL: https://issues.apache.org/jira/browse/HIVE-11082 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-11082.1-spark.patch For Dynamic RDD caching optimization, we found SparkPlan::connect throw exception while we try to combine 2 works with same child, support multi edge between nodes in SparkPlan would help to enable dynamic RDD caching in more use cases, like self join and self union. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HIVE-11204) Research on recent failed qtests[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li reassigned HIVE-11204: Assignee: Chengxiang Li Research on recent failed qtests[Spark Branch] -- Key: HIVE-11204 URL: https://issues.apache.org/jira/browse/HIVE-11204 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor Found some strange failed qtests in HIVE-11053 Hive QA, as it's pretty sure that failed qtests are not related to HIVE-11053 patch, so just reproduce and research it here. Failed tests: org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_bigdata org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_resolution org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_sort_1_23 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join_literals org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_mapreduce1 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_skewjoinopt2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_smb_mapjoin_15 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_19 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_4 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_8 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_view -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-11204) Research on recent failed qtests[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-11204: - Attachment: HIVE-11204.1-spark.patch Can not reproduce locally now, just upload an empty patch to verify. Research on recent failed qtests[Spark Branch] -- Key: HIVE-11204 URL: https://issues.apache.org/jira/browse/HIVE-11204 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor Attachments: HIVE-11204.1-spark.patch Found some strange failed qtests in HIVE-11053 Hive QA, as it's pretty sure that failed qtests are not related to HIVE-11053 patch, so just reproduce and research it here. Failed tests: org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_bigdata org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_resolution org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_sort_1_23 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join_literals org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_mapreduce1 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_skewjoinopt2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_smb_mapjoin_15 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_19 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_4 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_8 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_view -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HIVE-11082) Support multi edge between nodes in SparkPlan[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li reassigned HIVE-11082: Assignee: Chengxiang Li Support multi edge between nodes in SparkPlan[Spark Branch] --- Key: HIVE-11082 URL: https://issues.apache.org/jira/browse/HIVE-11082 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li For Dynamic RDD caching optimization, we found SparkPlan::connect throw exception while we try to combine 2 works with same child, support multi edge between nodes in SparkPlan would help to enable dynamic RDD caching in more use cases, like self join and self union. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11053) Add more tests for HIVE-10844[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618401#comment-14618401 ] Chengxiang Li commented on HIVE-11053: -- Committed to spark branch, thanks [~gallenvara_bg] for the contribution. Add more tests for HIVE-10844[Spark Branch] --- Key: HIVE-11053 URL: https://issues.apache.org/jira/browse/HIVE-11053 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Chengxiang Li Assignee: GaoLun Priority: Minor Fix For: spark-branch Attachments: HIVE-11053.1-spark.patch, HIVE-11053.2-spark.patch, HIVE-11053.3-spark.patch, HIVE-11053.4-spark.patch, HIVE-11053.5-spark.patch, HIVE-11053.5-spark.patch Add some test cases for self union, self-join, CWE, and repeated sub-queries to verify the job of combining quivalent works in HIVE-10844. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-11053) Add more tests for HIVE-10844[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-11053: - Fix Version/s: spark-branch Add more tests for HIVE-10844[Spark Branch] --- Key: HIVE-11053 URL: https://issues.apache.org/jira/browse/HIVE-11053 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Chengxiang Li Assignee: GaoLun Priority: Minor Fix For: spark-branch Attachments: HIVE-11053.1-spark.patch, HIVE-11053.2-spark.patch, HIVE-11053.3-spark.patch, HIVE-11053.4-spark.patch, HIVE-11053.5-spark.patch, HIVE-11053.5-spark.patch Add some test cases for self union, self-join, CWE, and repeated sub-queries to verify the job of combining quivalent works in HIVE-10844. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11053) Add more tests for HIVE-10844[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618235#comment-14618235 ] Chengxiang Li commented on HIVE-11053: -- The failed spark tests should not related to this patch, i would create another JIRA to track it. Add more tests for HIVE-10844[Spark Branch] --- Key: HIVE-11053 URL: https://issues.apache.org/jira/browse/HIVE-11053 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Chengxiang Li Assignee: GaoLun Priority: Minor Attachments: HIVE-11053.1-spark.patch, HIVE-11053.2-spark.patch, HIVE-11053.3-spark.patch, HIVE-11053.4-spark.patch, HIVE-11053.5-spark.patch, HIVE-11053.5-spark.patch Add some test cases for self union, self-join, CWE, and repeated sub-queries to verify the job of combining quivalent works in HIVE-10844. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11053) Add more tests for HIVE-10844[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618236#comment-14618236 ] Chengxiang Li commented on HIVE-11053: -- +1 Add more tests for HIVE-10844[Spark Branch] --- Key: HIVE-11053 URL: https://issues.apache.org/jira/browse/HIVE-11053 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Chengxiang Li Assignee: GaoLun Priority: Minor Attachments: HIVE-11053.1-spark.patch, HIVE-11053.2-spark.patch, HIVE-11053.3-spark.patch, HIVE-11053.4-spark.patch, HIVE-11053.5-spark.patch, HIVE-11053.5-spark.patch Add some test cases for self union, self-join, CWE, and repeated sub-queries to verify the job of combining quivalent works in HIVE-10844. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10850) Followup for HIVE-10550, check performance w.r.t. persistence level [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-10850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-10850: - Assignee: GaoLun (was: Chengxiang Li) Followup for HIVE-10550, check performance w.r.t. persistence level [Spark Branch] -- Key: HIVE-10850 URL: https://issues.apache.org/jira/browse/HIVE-10850 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: 1.2.0, 1.1.0 Reporter: Xuefu Zhang Assignee: GaoLun In HIVE-10550, there was a discussion on the persistence level and whether we need to give user some control over this. This JIRA is to investigate more, especially measuring performance under difference conditions, and further the discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-11053) Add more tests for HIVE-10844[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-11053: - Attachment: HIVE-11053.5-spark.patch upload the patch to relaunch unit test. Add more tests for HIVE-10844[Spark Branch] --- Key: HIVE-11053 URL: https://issues.apache.org/jira/browse/HIVE-11053 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Chengxiang Li Assignee: GaoLun Priority: Minor Attachments: HIVE-11053.1-spark.patch, HIVE-11053.2-spark.patch, HIVE-11053.3-spark.patch, HIVE-11053.4-spark.patch, HIVE-11053.5-spark.patch, HIVE-11053.5-spark.patch Add some test cases for self union, self-join, CWE, and repeated sub-queries to verify the job of combining quivalent works in HIVE-10844. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607606#comment-14607606 ] Chengxiang Li commented on HIVE-11095: -- Hi, [~xiaowei], After get +1, it need wait 24 hours before commit to make sure others has opportunity to review as well, just the way how community works, patch looks good. SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 2.0.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt, HIVE-11095.3.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607607#comment-14607607 ] Chengxiang Li commented on HIVE-11095: -- Hi, [~xiaowei], After get +1, it need wait 24 hours before commit to make sure others has opportunity to review as well, just the way how community works, patch looks good. SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 2.0.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt, HIVE-11095.3.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11138) Query fails when there isn't a comparator for an operator [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607574#comment-14607574 ] Chengxiang Li commented on HIVE-11138: -- +1, patch LGTM. Query fails when there isn't a comparator for an operator [Spark Branch] Key: HIVE-11138 URL: https://issues.apache.org/jira/browse/HIVE-11138 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Rui Li Assignee: Rui Li Attachments: HIVE-11138.1-spark.patch In such case, OperatorComparatorFactory should default to false instead of throw exceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10983) SerDeUtils bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-10983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602551#comment-14602551 ] Chengxiang Li commented on HIVE-10983: -- Nice found, thanks for working on this issue, [~xiaowei]. For the patch, do you think we can just use {code:java} return new Text(new String(text.getBytes(), 0, text.getLength(), previousCharset)) {code} so that we do not need extra memory copy introduced in the patch. SerDeUtils bug ,when Text is reused - Key: HIVE-10983 URL: https://issues.apache.org/jira/browse/HIVE-10983 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Labels: patch Fix For: 0.14.1, 1.2.0 Attachments: HIVE-10983.1.patch.txt, HIVE-10983.2.patch.txt {noformat} The mothod transformTextToUTF8 have a error bug,It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10983) SerDeUtils bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-10983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602637#comment-14602637 ] Chengxiang Li commented on HIVE-10983: -- Great, [~xiaowei], let's wait for the unit test result. Besides, could you also test it with your own test case. SerDeUtils bug ,when Text is reused - Key: HIVE-10983 URL: https://issues.apache.org/jira/browse/HIVE-10983 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Labels: patch Fix For: 0.14.1, 1.2.0 Attachments: HIVE-10983.1.patch.txt, HIVE-10983.2.patch.txt, HIVE-10983.3.patch.txt, HIVE-10983.4.patch.txt {noformat} The mothod transformTextToUTF8 have a error bug,It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602714#comment-14602714 ] Chengxiang Li commented on HIVE-11095: -- [~xiaowei], this should be the same issue as HIVE-10983, normally, we desire to handle it in a single JIRA, would you like to merge this patch into HIVE-10983? SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 1.2.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10999) Upgrade Spark dependency to 1.4 [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-10999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598935#comment-14598935 ] Chengxiang Li commented on HIVE-10999: -- The classpath update code change looks good to me, i'm +1 on this patch. Upgrade Spark dependency to 1.4 [Spark Branch] -- Key: HIVE-10999 URL: https://issues.apache.org/jira/browse/HIVE-10999 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Rui Li Attachments: HIVE-10999.1-spark.patch, HIVE-10999.2-spark.patch, HIVE-10999.3-spark.patch, HIVE-10999.3-spark.patch Spark 1.4.0 is release. Let's update the dependency version from 1.3.1 to 1.4.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10844) Combine equivalent Works for HoS[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-10844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-10844: - Attachment: HIVE-10844.3-spark.patch Combine equivalent Works for HoS[Spark Branch] -- Key: HIVE-10844 URL: https://issues.apache.org/jira/browse/HIVE-10844 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-10844.1-spark.patch, HIVE-10844.2-spark.patch, HIVE-10844.3-spark.patch Some Hive queries(like [TPCDS Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql]) may share the same subquery, which translated into sperate, but equivalent Works in SparkWork, combining these equivalent Works into a single one would help to benifit from following dynamic RDD caching optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-11053) Add more tests for HIVE-10844[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-11053: - Assignee: GAOLUN Add more tests for HIVE-10844[Spark Branch] --- Key: HIVE-11053 URL: https://issues.apache.org/jira/browse/HIVE-11053 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: GAOLUN Priority: Minor Add some test cases for self union, self-join, CWE, and repeated sub-queries to verify the job of combining quivalent works in HIVE-10844. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10999) Upgrade Spark dependency to 1.4 [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-10999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598755#comment-14598755 ] Chengxiang Li commented on HIVE-10999: -- Seems the latest upload patch pass all the tests, except org.apache.hadoop.hive.cli.TestCliDriver.initializationError. :) Upgrade Spark dependency to 1.4 [Spark Branch] -- Key: HIVE-10999 URL: https://issues.apache.org/jira/browse/HIVE-10999 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Rui Li Attachments: HIVE-10999.1-spark.patch, HIVE-10999.2-spark.patch, HIVE-10999.3-spark.patch, HIVE-10999.3-spark.patch Spark 1.4.0 is release. Let's update the dependency version from 1.3.1 to 1.4.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-11053) Add more tests for HIVE-10844[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-11053: - Assignee: GAOLUN Add more tests for HIVE-10844[Spark Branch] --- Key: HIVE-11053 URL: https://issues.apache.org/jira/browse/HIVE-11053 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: GAOLUN Priority: Minor Add some test cases for self union, self-join, CWE, and repeated sub-queries to verify the job of combining quivalent works in HIVE-10844. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-11053) Add more tests for HIVE-10844[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-11053: - Assignee: (was: GAOLUN) Add more tests for HIVE-10844[Spark Branch] --- Key: HIVE-11053 URL: https://issues.apache.org/jira/browse/HIVE-11053 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Priority: Minor Add some test cases for self union, self-join, CWE, and repeated sub-queries to verify the job of combining quivalent works in HIVE-10844. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10844) Combine equivalent Works for HoS[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-10844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591169#comment-14591169 ] Chengxiang Li commented on HIVE-10844: -- The failed test should be irrelevant, [~xuefuz], the patch is ready for review now. Combine equivalent Works for HoS[Spark Branch] -- Key: HIVE-10844 URL: https://issues.apache.org/jira/browse/HIVE-10844 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-10844.1-spark.patch, HIVE-10844.2-spark.patch Some Hive queries(like [TPCDS Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql]) may share the same subquery, which translated into sperate, but equivalent Works in SparkWork, combining these equivalent Works into a single one would help to benifit from following dynamic RDD caching optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10844) Combine equivalent Works for HoS[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-10844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-10844: - Attachment: HIVE-10844.2-spark.patch Combine equivalent Works for HoS[Spark Branch] -- Key: HIVE-10844 URL: https://issues.apache.org/jira/browse/HIVE-10844 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-10844.1-spark.patch, HIVE-10844.2-spark.patch Some Hive queries(like [TPCDS Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql]) may share the same subquery, which translated into sperate, but equivalent Works in SparkWork, combining these equivalent Works into a single one would help to benifit from following dynamic RDD caching optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9370) SparkJobMonitor timeout as sortByKey would launch extra Spark job before original job get submitted [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567055#comment-14567055 ] Chengxiang Li commented on HIVE-9370: - Thanks for asking, [~leftylev]. We have printed the error message to CLI console, i think we do not need to notify this to user on document especially. SparkJobMonitor timeout as sortByKey would launch extra Spark job before original job get submitted [Spark Branch] -- Key: HIVE-9370 URL: https://issues.apache.org/jira/browse/HIVE-9370 Project: Hive Issue Type: Sub-task Components: Spark Reporter: yuyun.chen Assignee: Chengxiang Li Fix For: 1.1.0 Attachments: HIVE-9370.1-spark.patch enable hive on spark and run BigBench Query 8 then got the following exception: 2015-01-14 11:43:46,057 INFO [main]: impl.RemoteSparkJobStatus (RemoteSparkJobStatus.java:getSparkJobInfo(143)) - Job hasn't been submitted after 30s. Aborting it. 2015-01-14 11:43:46,061 INFO [main]: impl.RemoteSparkJobStatus (RemoteSparkJobStatus.java:getSparkJobInfo(143)) - Job hasn't been submitted after 30s. Aborting it. 2015-01-14 11:43:46,061 ERROR [main]: status.SparkJobMonitor (SessionState.java:printError(839)) - Status: Failed 2015-01-14 11:43:46,062 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) - /PERFLOG method=SparkRunJob start=1421206996052 end=1421207026062 duration=30010 from=org.apache.hadoop.hive.ql.exec.spark.status.SparkJobMonitor 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) - 15/01/14 11:43:46 INFO RemoteDriver: Failed to run job 0a9a7782-0e0b-4561-8468-959a6d8df0a3 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) - java.lang.InterruptedException 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at java.lang.Object.wait(Native Method) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at java.lang.Object.wait(Object.java:503) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1282) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1300) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1314) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.rdd.RDD.collect(RDD.scala:780) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:262) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.RangePartitioner.init(Partitioner.scala:124) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:63) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:894) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:864) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.hadoop.hive.ql.exec.spark.SortByShuffler.shuffle(SortByShuffler.java:48) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at
[jira] [Updated] (HIVE-10844) Combine equivalent Works for HoS[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-10844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-10844: - Attachment: HIVE-10844.1-spark.patch Combine equivalent Works for HoS[Spark Branch] -- Key: HIVE-10844 URL: https://issues.apache.org/jira/browse/HIVE-10844 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-10844.1-spark.patch Some Hive queries(like [TPCDS Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql]) may share the same subquery, which translated into sperate, but equivalent Works in SparkWork, combining these equivalent Works into a single one would help to benifit from following dynamic RDD caching optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562312#comment-14562312 ] Chengxiang Li commented on HIVE-10550: -- Committed to spark branch, thanks [~xuefuz] for review. Dynamic RDD caching optimization for HoS.[Spark Branch] --- Key: HIVE-10550 URL: https://issues.apache.org/jira/browse/HIVE-10550 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, HIVE-10550.2-spark.patch, HIVE-10550.3-spark.patch, HIVE-10550.4-spark.patch, HIVE-10550.5-spark.patch, HIVE-10550.6-spark.patch A Hive query may try to scan the same table multi times, like self-join, self-union, or even share the same subquery, [TPC-DS Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql] is an example. As you may know that, Spark support cache RDD data, which mean Spark would put the calculated RDD data in memory and get the data from memory directly for next time, this avoid the calculation cost of this RDD(and all the cost of its dependencies) at the cost of more memory usage. Through analyze the query context, we should be able to understand which part of query could be shared, so that we can reuse the cached RDD in the generated Spark job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562241#comment-14562241 ] Chengxiang Li commented on HIVE-10550: -- Note: these configurations has been removed in latest patch. Dynamic RDD caching optimization for HoS.[Spark Branch] --- Key: HIVE-10550 URL: https://issues.apache.org/jira/browse/HIVE-10550 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, HIVE-10550.2-spark.patch, HIVE-10550.3-spark.patch, HIVE-10550.4-spark.patch, HIVE-10550.5-spark.patch, HIVE-10550.6-spark.patch A Hive query may try to scan the same table multi times, like self-join, self-union, or even share the same subquery, [TPC-DS Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql] is an example. As you may know that, Spark support cache RDD data, which mean Spark would put the calculated RDD data in memory and get the data from memory directly for next time, this avoid the calculation cost of this RDD(and all the cost of its dependencies) at the cost of more memory usage. Through analyze the query context, we should be able to understand which part of query could be shared, so that we can reuse the cached RDD in the generated Spark job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-10550: - Attachment: HIVE-10550.6-spark.patch Dynamic RDD caching optimization for HoS.[Spark Branch] --- Key: HIVE-10550 URL: https://issues.apache.org/jira/browse/HIVE-10550 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, HIVE-10550.2-spark.patch, HIVE-10550.3-spark.patch, HIVE-10550.4-spark.patch, HIVE-10550.5-spark.patch, HIVE-10550.6-spark.patch A Hive query may try to scan the same table multi times, like self-join, self-union, or even share the same subquery, [TPC-DS Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql] is an example. As you may know that, Spark support cache RDD data, which mean Spark would put the calculated RDD data in memory and get the data from memory directly for next time, this avoid the calculation cost of this RDD(and all the cost of its dependencies) at the cost of more memory usage. Through analyze the query context, we should be able to understand which part of query could be shared, so that we can reuse the cached RDD in the generated Spark job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-10550: - Attachment: HIVE-10550.5-spark.patch Dynamic RDD caching optimization for HoS.[Spark Branch] --- Key: HIVE-10550 URL: https://issues.apache.org/jira/browse/HIVE-10550 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, HIVE-10550.2-spark.patch, HIVE-10550.3-spark.patch, HIVE-10550.4-spark.patch, HIVE-10550.5-spark.patch A Hive query may try to scan the same table multi times, like self-join, self-union, or even share the same subquery, [TPC-DS Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql] is an example. As you may know that, Spark support cache RDD data, which mean Spark would put the calculated RDD data in memory and get the data from memory directly for next time, this avoid the calculation cost of this RDD(and all the cost of its dependencies) at the cost of more memory usage. Through analyze the query context, we should be able to understand which part of query could be shared, so that we can reuse the cached RDD in the generated Spark job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-10550: - Attachment: HIVE-10550.4-spark.patch Dynamic RDD caching optimization for HoS.[Spark Branch] --- Key: HIVE-10550 URL: https://issues.apache.org/jira/browse/HIVE-10550 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, HIVE-10550.2-spark.patch, HIVE-10550.3-spark.patch, HIVE-10550.4-spark.patch A Hive query may try to scan the same table multi times, like self-join, self-union, or even share the same subquery, [TPC-DS Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql] is an example. As you may know that, Spark support cache RDD data, which mean Spark would put the calculated RDD data in memory and get the data from memory directly for next time, this avoid the calculation cost of this RDD(and all the cost of its dependencies) at the cost of more memory usage. Through analyze the query context, we should be able to understand which part of query could be shared, so that we can reuse the cached RDD in the generated Spark job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-10550: - Attachment: HIVE-10550.2-spark.patch Dynamic RDD caching optimization for HoS.[Spark Branch] --- Key: HIVE-10550 URL: https://issues.apache.org/jira/browse/HIVE-10550 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, HIVE-10550.2-spark.patch A Hive query may try to scan the same table multi times, like self-join, self-union, or even share the same subquery, [TPC-DS Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql] is an example. As you may know that, Spark support cache RDD data, which mean Spark would put the calculated RDD data in memory and get the data from memory directly for next time, this avoid the calculation cost of this RDD(and all the cost of its dependencies) at the cost of more memory usage. Through analyze the query context, we should be able to understand which part of query could be shared, so that we can reuse the cached RDD in the generated Spark job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-10550: - Attachment: (was: HIVE-10550.2-spark.patch) Dynamic RDD caching optimization for HoS.[Spark Branch] --- Key: HIVE-10550 URL: https://issues.apache.org/jira/browse/HIVE-10550 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, HIVE-10550.2-spark.patch A Hive query may try to scan the same table multi times, like self-join, self-union, or even share the same subquery, [TPC-DS Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql] is an example. As you may know that, Spark support cache RDD data, which mean Spark would put the calculated RDD data in memory and get the data from memory directly for next time, this avoid the calculation cost of this RDD(and all the cost of its dependencies) at the cost of more memory usage. Through analyze the query context, we should be able to understand which part of query could be shared, so that we can reuse the cached RDD in the generated Spark job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-10550: - Attachment: HIVE-10550.2-spark.patch Dynamic RDD caching optimization for HoS.[Spark Branch] --- Key: HIVE-10550 URL: https://issues.apache.org/jira/browse/HIVE-10550 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, HIVE-10550.2-spark.patch A Hive query may try to scan the same table multi times, like self-join, self-union, or even share the same subquery, [TPC-DS Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql] is an example. As you may know that, Spark support cache RDD data, which mean Spark would put the calculated RDD data in memory and get the data from memory directly for next time, this avoid the calculation cost of this RDD(and all the cost of its dependencies) at the cost of more memory usage. Through analyze the query context, we should be able to understand which part of query could be shared, so that we can reuse the cached RDD in the generated Spark job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547807#comment-14547807 ] Chengxiang Li commented on HIVE-10550: -- I'm not sure why, but i keep failed to upload patch to hive-git repo on our RB, i would try again later, [~xuefuz], do you mind to review on github(https://github.com/apache/hive/pull/36) at first? Dynamic RDD caching optimization for HoS.[Spark Branch] --- Key: HIVE-10550 URL: https://issues.apache.org/jira/browse/HIVE-10550 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, HIVE-10550.2-spark.patch A Hive query may try to scan the same table multi times, like self-join, self-union, or even share the same subquery, [TPC-DS Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql] is an example. As you may know that, Spark support cache RDD data, which mean Spark would put the calculated RDD data in memory and get the data from memory directly for next time, this avoid the calculation cost of this RDD(and all the cost of its dependencies) at the cost of more memory usage. Through analyze the query context, we should be able to understand which part of query could be shared, so that we can reuse the cached RDD in the generated Spark job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541662#comment-14541662 ] Chengxiang Li commented on HIVE-10550: -- New added configuration: ||name||default value|| |hive.spark.dynamic.rdd.caching|true| |hive.spark.dynamic.rdd.caching.threshold|100 * 1024 * 1024L(100M)| Dynamic RDD caching optimization for HoS.[Spark Branch] --- Key: HIVE-10550 URL: https://issues.apache.org/jira/browse/HIVE-10550 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-10550.1.patch A Hive query may try to scan the same table multi times, like self-join, self-union, or even share the same subquery, [TPC-DS Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql] is an example. As you may know that, Spark support cache RDD data, which mean Spark would put the calculated RDD data in memory and get the data from memory directly for next time, this avoid the calculation cost of this RDD(and all the cost of its dependencies) at the cost of more memory usage. Through analyze the query context, we should be able to understand which part of query could be shared, so that we can reuse the cached RDD in the generated Spark job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-10550: - Attachment: HIVE-10550.1.patch Dynamic RDD caching optimization for HoS.[Spark Branch] --- Key: HIVE-10550 URL: https://issues.apache.org/jira/browse/HIVE-10550 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-10550.1.patch A Hive query may try to scan the same table multi times, like self-join, self-union, or even share the same subquery, [TPC-DS Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql] is an example. As you may know that, Spark support cache RDD data, which mean Spark would put the calculated RDD data in memory and get the data from memory directly for next time, this avoid the calculation cost of this RDD(and all the cost of its dependencies) at the cost of more memory usage. Through analyze the query context, we should be able to understand which part of query could be shared, so that we can reuse the cached RDD in the generated Spark job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10671) yarn-cluster mode offers a degraded performance from yarn-client [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-10671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543222#comment-14543222 ] Chengxiang Li commented on HIVE-10671: -- LGTM, +1 yarn-cluster mode offers a degraded performance from yarn-client [Spark Branch] --- Key: HIVE-10671 URL: https://issues.apache.org/jira/browse/HIVE-10671 Project: Hive Issue Type: Bug Components: Spark Reporter: Xuefu Zhang Assignee: Rui Li Attachments: HIVE-10671.1-spark.patch With Hive on Spark, users noticed that in certain cases spark.master=yarn-client offers 2x or 3x better performance than if spark.master=yarn-cluster. However, yarn-cluster is what we recommend and support. Thus, we should investigate and fix the problem. One of the such queries is TPC-H 22. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10548) Remove dependency to s3 repository in root pom
[ https://issues.apache.org/jira/browse/HIVE-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541373#comment-14541373 ] Chengxiang Li commented on HIVE-10548: -- Committed to master, thanks Szehon for review. Remove dependency to s3 repository in root pom -- Key: HIVE-10548 URL: https://issues.apache.org/jira/browse/HIVE-10548 Project: Hive Issue Type: Bug Components: Build Infrastructure Reporter: Szehon Ho Assignee: Chengxiang Li Attachments: HIVE-10548.2.patch, HIVE-10548.2.patch, HIVE-10548.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10548) Remove dependency to s3 repository in root pom
[ https://issues.apache.org/jira/browse/HIVE-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-10548: - Attachment: HIVE-10548.2.patch Remove dependency to s3 repository in root pom -- Key: HIVE-10548 URL: https://issues.apache.org/jira/browse/HIVE-10548 Project: Hive Issue Type: Bug Components: Build Infrastructure Reporter: Szehon Ho Assignee: Chengxiang Li Attachments: HIVE-10548.2.patch, HIVE-10548.2.patch, HIVE-10548.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10548) Remove dependency to s3 repository in root pom
[ https://issues.apache.org/jira/browse/HIVE-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-10548: - Attachment: HIVE-10548.2.patch Remove dependency to s3 repository in root pom -- Key: HIVE-10548 URL: https://issues.apache.org/jira/browse/HIVE-10548 Project: Hive Issue Type: Bug Components: Build Infrastructure Reporter: Szehon Ho Assignee: Szehon Ho Attachments: HIVE-10548.2.patch, HIVE-10548.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10235) Loop optimization for SIMD in ColumnDivideColumn.txt
[ https://issues.apache.org/jira/browse/HIVE-10235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495522#comment-14495522 ] Chengxiang Li commented on HIVE-10235: -- Environment: java version 1.8.0_40 Java(TM) SE Runtime Environment (build 1.8.0_40-b26) Java HotSpot(TM) 64-Bit Server VM (build 25.40-b25, mixed mode) Intel(R) Core(TM) i3-2130 CPU @ 3.40GHz Linux version 2.6.32-279.el6.x86_64 Loop optimization for SIMD in ColumnDivideColumn.txt Key: HIVE-10235 URL: https://issues.apache.org/jira/browse/HIVE-10235 Project: Hive Issue Type: Sub-task Components: Vectorization Affects Versions: 1.1.0 Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor Attachments: HIVE-10235.1.patch Found two loop which could be optimized for packed instruction set during execution. 1. hasDivBy0 depends on the result of last loop, which prevent the loop be executed vectorized. {code:java} for(int i = 0; i != n; i++) { OperandType2 denom = vector2[i]; outputVector[i] = vector1[0] OperatorSymbol denom; hasDivBy0 = hasDivBy0 || (denom == 0); } {code} 2. same as HIVE-10180, vector2\[0\] reference provent JVM optimizing loop into packed instruction set. {code:java} for(int i = 0; i != n; i++) { outputVector[i] = vector1[i] OperatorSymbol vector2[0]; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10235) Loop optimization for SIMD in ColumnDivideColumn.txt
[ https://issues.apache.org/jira/browse/HIVE-10235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495524#comment-14495524 ] Chengxiang Li commented on HIVE-10235: -- The failed test is irrelevant, [~gopalv], could you help to review this patch? Loop optimization for SIMD in ColumnDivideColumn.txt Key: HIVE-10235 URL: https://issues.apache.org/jira/browse/HIVE-10235 Project: Hive Issue Type: Sub-task Components: Vectorization Affects Versions: 1.1.0 Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor Attachments: HIVE-10235.1.patch Found two loop which could be optimized for packed instruction set during execution. 1. hasDivBy0 depends on the result of last loop, which prevent the loop be executed vectorized. {code:java} for(int i = 0; i != n; i++) { OperandType2 denom = vector2[i]; outputVector[i] = vector1[0] OperatorSymbol denom; hasDivBy0 = hasDivBy0 || (denom == 0); } {code} 2. same as HIVE-10180, vector2\[0\] reference provent JVM optimizing loop into packed instruction set. {code:java} for(int i = 0; i != n; i++) { outputVector[i] = vector1[i] OperatorSymbol vector2[0]; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10180) Loop optimization for SIMD in ColumnArithmeticColumn.txt
[ https://issues.apache.org/jira/browse/HIVE-10180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491819#comment-14491819 ] Chengxiang Li commented on HIVE-10180: -- Committed to trunk, thanks Gopal for review. Loop optimization for SIMD in ColumnArithmeticColumn.txt Key: HIVE-10180 URL: https://issues.apache.org/jira/browse/HIVE-10180 Project: Hive Issue Type: Sub-task Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor Attachments: HIVE-10180.1.patch, HIVE-10180.2.patch JVM is quite strict on the code schema which may executed with SIMD instructions, take a loop in DoubleColAddDoubleColumn.java for example, {code:java} for (int i = 0; i != n; i++) { outputVector[i] = vector1[0] + vector2[i]; } {code} The vector1[0] reference would prevent JVM to execute this part of code with vectorized instructions, we need to assign the vector1[0] to a variable outside of loop, and use that variable in loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10235) Loop optimization for SIMD in ColumnDivideColumn.txt
[ https://issues.apache.org/jira/browse/HIVE-10235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-10235: - Attachment: HIVE-10235.1.patch Test with Jmh VectorizationBench by the following command: {code:actionscript} java -jar hive-jmh/target/benchmarks.jar org.apache.hive.benchmark.vectorization VectorizationBench -wi 3 -i 5 -f 1 -bm avgt -tu ms {code} The performance result looks like ||Expressions||/w patch(ms)||/w/o patch(ms)|| |DoubleColDivideDoubleColumn|4033|6654| |DoubleColDivideRepeatingDoubleColumn|1563|3048| |LongColDivideLongColumn|7354|7561| |LongColDivideRepeatingColumn|3161|3163| For for double array division in loop, the packed instruction vdivpd is used instead of vdivsd with patch applied, while there is no such instruction for long division, so there is no improvement for long array division in loop. Loop optimization for SIMD in ColumnDivideColumn.txt Key: HIVE-10235 URL: https://issues.apache.org/jira/browse/HIVE-10235 Project: Hive Issue Type: Sub-task Components: Vectorization Affects Versions: 1.1.0 Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor Attachments: HIVE-10235.1.patch Found two loop which could be optimized for packed instruction set during execution. 1. hasDivBy0 depends on the result of last loop, which prevent the loop be executed vectorized. {code:java} for(int i = 0; i != n; i++) { OperandType2 denom = vector2[i]; outputVector[i] = vector1[0] OperatorSymbol denom; hasDivBy0 = hasDivBy0 || (denom == 0); } {code} 2. same as HIVE-10180, vector2\[0\] reference provent JVM optimizing loop into packed instruction set. {code:java} for(int i = 0; i != n; i++) { outputVector[i] = vector1[i] OperatorSymbol vector2[0]; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10189) Create a micro benchmark tool for vectorization to evaluate the performance gain after SIMD optimization
[ https://issues.apache.org/jira/browse/HIVE-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491822#comment-14491822 ] Chengxiang Li commented on HIVE-10189: -- Committed to the trunk, thanks Ferdinad for this contribution. Create a micro benchmark tool for vectorization to evaluate the performance gain after SIMD optimization Key: HIVE-10189 URL: https://issues.apache.org/jira/browse/HIVE-10189 Project: Hive Issue Type: Sub-task Reporter: Ferdinand Xu Assignee: Ferdinand Xu Attachments: HIVE-10189.1.patch, HIVE-10189.2.patch, HIVE-10189.patch, avx-64.docx We should show the performance gain from SIMD optimization. Current score is as follows: BenchmarkMode Samples Score Error Units o.a.h.b.v.VectorizationBench.DoubleAddDoubleExpr.bench avgt2 20719.882 ? NaN ns/op o.a.h.b.v.VectorizationBench.DoubleAddLongExpr.bench avgt2 22216.747 ? NaN ns/op o.a.h.b.v.VectorizationBench.DoubleDivideDoubleExpr.benchavgt2 54319.682 ? NaN ns/op o.a.h.b.v.VectorizationBench.DoubleDivideLongExpr.bench avgt2 34774.870 ? NaN ns/op o.a.h.b.v.VectorizationBench.LongAddDoubleExpr.bench avgt2 47144.954 ? NaN ns/op o.a.h.b.v.VectorizationBench.LongAddLongExpr.bench avgt2 21483.787 ? NaN ns/op o.a.h.b.v.VectorizationBench.LongDivideDoubleExpr.bench avgt2 49765.990 ? NaN ns/op o.a.h.b.v.VectorizationBench.LongDivideLongExpr.benchavgt2 34117.538 ? NaN ns/op -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10189) Create a micro benchmark tool for vectorization to evaluate the performance gain after SIMD optimization
[ https://issues.apache.org/jira/browse/HIVE-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485243#comment-14485243 ] Chengxiang Li commented on HIVE-10189: -- +1 Create a micro benchmark tool for vectorization to evaluate the performance gain after SIMD optimization Key: HIVE-10189 URL: https://issues.apache.org/jira/browse/HIVE-10189 Project: Hive Issue Type: Sub-task Reporter: Ferdinand Xu Assignee: Ferdinand Xu Attachments: HIVE-10189.1.patch, HIVE-10189.2.patch, HIVE-10189.patch, avx-64.docx We should show the performance gain from SIMD optimization. Current score is as follows: BenchmarkMode Samples Score Error Units o.a.h.b.v.VectorizationBench.DoubleAddDoubleExpr.bench avgt2 20719.882 ? NaN ns/op o.a.h.b.v.VectorizationBench.DoubleAddLongExpr.bench avgt2 22216.747 ? NaN ns/op o.a.h.b.v.VectorizationBench.DoubleDivideDoubleExpr.benchavgt2 54319.682 ? NaN ns/op o.a.h.b.v.VectorizationBench.DoubleDivideLongExpr.bench avgt2 34774.870 ? NaN ns/op o.a.h.b.v.VectorizationBench.LongAddDoubleExpr.bench avgt2 47144.954 ? NaN ns/op o.a.h.b.v.VectorizationBench.LongAddLongExpr.bench avgt2 21483.787 ? NaN ns/op o.a.h.b.v.VectorizationBench.LongDivideDoubleExpr.bench avgt2 49765.990 ? NaN ns/op o.a.h.b.v.VectorizationBench.LongDivideLongExpr.benchavgt2 34117.538 ? NaN ns/op -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10180) Loop optimization for SIMD in ColumnArithmeticColumn.txt
[ https://issues.apache.org/jira/browse/HIVE-10180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482759#comment-14482759 ] Chengxiang Li commented on HIVE-10180: -- Is your machine support AVX2 instruction set? you can verified this on [http://ark.intel.com/]. Besides, java option -XX:UseAvx=number is used to control what AVX instruction set would be used during execution. Loop optimization for SIMD in ColumnArithmeticColumn.txt Key: HIVE-10180 URL: https://issues.apache.org/jira/browse/HIVE-10180 Project: Hive Issue Type: Sub-task Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor Attachments: HIVE-10180.1.patch, HIVE-10180.2.patch JVM is quite strict on the code schema which may executed with SIMD instructions, take a loop in DoubleColAddDoubleColumn.java for example, {code:java} for (int i = 0; i != n; i++) { outputVector[i] = vector1[0] + vector2[i]; } {code} The vector1[0] reference would prevent JVM to execute this part of code with vectorized instructions, we need to assign the vector1[0] to a variable outside of loop, and use that variable in loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10180) Loop optimization for SIMD in ColumnArithmeticColumn.txt
[ https://issues.apache.org/jira/browse/HIVE-10180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-10180: - Attachment: HIVE-10180.2.patch Set new variables to final. Loop optimization for SIMD in ColumnArithmeticColumn.txt Key: HIVE-10180 URL: https://issues.apache.org/jira/browse/HIVE-10180 Project: Hive Issue Type: Sub-task Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor Attachments: HIVE-10180.1.patch, HIVE-10180.2.patch JVM is quite strict on the code schema which may executed with SIMD instructions, take a loop in DoubleColAddDoubleColumn.java for example, {code:java} for (int i = 0; i != n; i++) { outputVector[i] = vector1[0] + vector2[i]; } {code} The vector1[0] reference would prevent JVM to execute this part of code with vectorized instructions, we need to assign the vector1[0] to a variable outside of loop, and use that variable in loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)