[jira] [Commented] (HIVE-11693) CommonMergeJoinOperator throws exception with tez
[ https://issues.apache.org/jira/browse/HIVE-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302701#comment-15302701 ] Selina Zhang commented on HIVE-11693: - [~rajesh.balamohan], we hit the same issue recently. But I think the patch you attached did not fix the root problem. The issue is actually CommonMergeJoinOperator only set big table position when it has inputs for big table. {code:title=CommonMergeJoinOperator.java} @Override public void process(Object row, int tag) throws HiveException { posBigTable = (byte) conf.getBigTablePosition(); ... {code} If the input is empty, the above method will not be called. In the query you listed, a subquery is involved. The generated table is tagged as 0, while the left table is tagged as 1. GenTezWork.java set the big table position as 1 for both reduce work and CommonJoinOperator. In reduce phase, when ReduceRecordProcessor got executed, it retrieves the record from big table: {code:title=ReduceRecordProcessor.java} @Override void run() throws Exception { // run the operator pipeline while (sources[bigTablePosition].pushRecord()) { } } {code} The big table position here is 1. If the input from the big table is empty, this is the only place pushRecord() be called to read big table. However, because the CommonMergeJoinOperator missed set big table position, in closeOp() part, it will think tag 1 is small table, so another pushRecord() is called to retrieve table content. Then we see the exception listed in this JIRA. Please let me know if my analysis has problem. If you think it is correct, can you update the patch? Thanks > CommonMergeJoinOperator throws exception with tez > - > > Key: HIVE-11693 > URL: https://issues.apache.org/jira/browse/HIVE-11693 > Project: Hive > Issue Type: Bug >Reporter: Rajesh Balamohan > Attachments: HIVE-11693.1.patch > > > Got this when executing a simple query with latest hive build + tez latest > version. > {noformat} > Error: Failure while running task: > attempt_1439860407967_0291_2_03_45_0:java.lang.RuntimeException: > java.lang.RuntimeException: Hive Runtime Error while closing operators: > java.lang.RuntimeException: java.io.IOException: Please check if you are > invoking moveToNext() even after it returned false. > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171) > at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:349) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:71) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:60) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:60) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:35) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.RuntimeException: Hive Runtime Error while closing > operators: java.lang.RuntimeException: java.io.IOException: Please check if > you are invoking moveToNext() even after it returned false. > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:316) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162) > ... 14 more > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: > java.lang.RuntimeException: java.io.IOException: Please check if you are > invoking moveToNext() even after it returned false. > at > org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:412) > at > org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchNextGroup(CommonMergeJoinOperator.java:375) > at > org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.doFirstFetchIfNeeded(CommonMergeJoinOperator.java:482) > at > org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinFinalLeftData(CommonMergeJoinOperator.java:434) > at > org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.closeOp(CommonMergeJoinOperator.java:384) > at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:616) > at >
[jira] [Commented] (HIVE-10841) [WHERE col is not null] does not work sometimes for queries with many JOIN statements
[ https://issues.apache.org/jira/browse/HIVE-10841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563668#comment-14563668 ] Selina Zhang commented on HIVE-10841: - Seems this is introduced with Hive 0.13. Have tested with Hive 0.12 MR and got correct result. [WHERE col is not null] does not work sometimes for queries with many JOIN statements - Key: HIVE-10841 URL: https://issues.apache.org/jira/browse/HIVE-10841 Project: Hive Issue Type: Bug Components: Query Processor Reporter: Alexander Pivovarov The result from the following SELECT query is 3 rows but it should be 1 row. I checked it in MySQL - it returned 1 row. To reproduce the issue in Hive 1. prepare tables {code} drop table if exists L; drop table if exists LA; drop table if exists FR; drop table if exists A; drop table if exists PI; drop table if exists acct; create table L as select 4436 id; create table LA as select 4436 loan_id, 4748 aid, 4415 pi_id; create table FR as select 4436 loan_id; create table A as select 4748 id; create table PI as select 4415 id; create table acct as select 4748 aid, 10 acc_n, 122 brn; insert into table acct values(4748, null, null); insert into table acct values(4748, null, null); {code} 2. run SELECT query {code} select acct.ACC_N, acct.brn FROM L JOIN LA ON L.id = LA.loan_id JOIN FR ON L.id = FR.loan_id JOIN A ON LA.aid = A.id JOIN PI ON PI.id = LA.pi_id JOIN acct ON A.id = acct.aid WHERE L.id = 4436 and acct.brn is not null; {code} the result is 3 rows {code} 10122 NULL NULL NULL NULL {code} but it should be 1 row {code} 10122 {code} 2.1 explain select ... output for hive-1.3.0 MR {code} STAGE DEPENDENCIES: Stage-12 is a root stage Stage-9 depends on stages: Stage-12 Stage-0 depends on stages: Stage-9 STAGE PLANS: Stage: Stage-12 Map Reduce Local Work Alias - Map Local Tables: a Fetch Operator limit: -1 acct Fetch Operator limit: -1 fr Fetch Operator limit: -1 l Fetch Operator limit: -1 pi Fetch Operator limit: -1 Alias - Map Local Operator Tree: a TableScan alias: a Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: id is not null (type: boolean) Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 _col5 (type: int) 1 id (type: int) 2 aid (type: int) acct TableScan alias: acct Statistics: Num rows: 3 Data size: 31 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: aid is not null (type: boolean) Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 _col5 (type: int) 1 id (type: int) 2 aid (type: int) fr TableScan alias: fr Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (loan_id = 4436) (type: boolean) Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 4436 (type: int) 1 4436 (type: int) 2 4436 (type: int) l TableScan alias: l Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (id = 4436) (type: boolean) Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 4436 (type: int) 1 4436 (type: int) 2 4436 (type: int) pi TableScan alias: pi Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: id is not null (type: boolean) Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 _col6 (type: int) 1 id (type: int)
[jira] [Commented] (HIVE-10809) HCat FileOutputCommitterContainer leaves behind empty _SCRATCH directories
[ https://issues.apache.org/jira/browse/HIVE-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561450#comment-14561450 ] Selina Zhang commented on HIVE-10809: - Thanks, [~swarnim]! I have added a new patch to address the concern 1. For the concern 2, because we always do a drop table at the beginning of each test case, I think it may not be necessary to verify the table directory existed. I also modified the test testMultiPartColsInData() to make sure the multiple level partition keys case works. HCat FileOutputCommitterContainer leaves behind empty _SCRATCH directories -- Key: HIVE-10809 URL: https://issues.apache.org/jira/browse/HIVE-10809 Project: Hive Issue Type: Bug Components: HCatalog Affects Versions: 1.2.0 Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10809.1.patch, HIVE-10809.2.patch When static partition is added through HCatStorer or HCatWriter {code} JoinedData = LOAD '/user/selinaz/data/part-r-0' USING JsonLoader(); STORE JoinedData INTO 'selina.joined_events_e' USING org.apache.hive.hcatalog.pig.HCatStorer('author=selina'); {code} The table directory looks like {noformat} drwx-- - selinaz users 0 2015-05-22 21:19 /user/selinaz/joined_events_e/_SCRATCH0.9157208938193798 drwx-- - selinaz users 0 2015-05-22 21:19 /user/selinaz/joined_events_e/author=selina {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10809) HCat FileOutputCommitterContainer leaves behind empty _SCRATCH directories
[ https://issues.apache.org/jira/browse/HIVE-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561449#comment-14561449 ] Selina Zhang commented on HIVE-10809: - Thanks, [~swarnim]! I have added a new patch to address the concern 1. For the concern 2, because we always do a drop table at the beginning of each test case, I think it may not be necessary to verify the table directory existed. I also modified the test testMultiPartColsInData() to make sure the multiple level partition keys case works. HCat FileOutputCommitterContainer leaves behind empty _SCRATCH directories -- Key: HIVE-10809 URL: https://issues.apache.org/jira/browse/HIVE-10809 Project: Hive Issue Type: Bug Components: HCatalog Affects Versions: 1.2.0 Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10809.1.patch, HIVE-10809.2.patch When static partition is added through HCatStorer or HCatWriter {code} JoinedData = LOAD '/user/selinaz/data/part-r-0' USING JsonLoader(); STORE JoinedData INTO 'selina.joined_events_e' USING org.apache.hive.hcatalog.pig.HCatStorer('author=selina'); {code} The table directory looks like {noformat} drwx-- - selinaz users 0 2015-05-22 21:19 /user/selinaz/joined_events_e/_SCRATCH0.9157208938193798 drwx-- - selinaz users 0 2015-05-22 21:19 /user/selinaz/joined_events_e/author=selina {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10809) HCat FileOutputCommitterContainer leaves behind empty _SCRATCH directories
[ https://issues.apache.org/jira/browse/HIVE-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Selina Zhang updated HIVE-10809: Attachment: HIVE-10809.3.patch HCat FileOutputCommitterContainer leaves behind empty _SCRATCH directories -- Key: HIVE-10809 URL: https://issues.apache.org/jira/browse/HIVE-10809 Project: Hive Issue Type: Bug Components: HCatalog Affects Versions: 1.2.0 Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10809.1.patch, HIVE-10809.2.patch, HIVE-10809.3.patch When static partition is added through HCatStorer or HCatWriter {code} JoinedData = LOAD '/user/selinaz/data/part-r-0' USING JsonLoader(); STORE JoinedData INTO 'selina.joined_events_e' USING org.apache.hive.hcatalog.pig.HCatStorer('author=selina'); {code} The table directory looks like {noformat} drwx-- - selinaz users 0 2015-05-22 21:19 /user/selinaz/joined_events_e/_SCRATCH0.9157208938193798 drwx-- - selinaz users 0 2015-05-22 21:19 /user/selinaz/joined_events_e/author=selina {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10809) HCat FileOutputCommitterContainer leaves behind empty _SCRATCH directories
[ https://issues.apache.org/jira/browse/HIVE-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Selina Zhang updated HIVE-10809: Attachment: HIVE-10809.2.patch The above unit test failures seem not relevant to this patch. Uploaded a new patch. Add verification in TestHCatStorer to verify the scratch directories are removed. HCat FileOutputCommitterContainer leaves behind empty _SCRATCH directories -- Key: HIVE-10809 URL: https://issues.apache.org/jira/browse/HIVE-10809 Project: Hive Issue Type: Bug Components: HCatalog Affects Versions: 1.2.0 Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10809.1.patch, HIVE-10809.2.patch When static partition is added through HCatStorer or HCatWriter {code} JoinedData = LOAD '/user/selinaz/data/part-r-0' USING JsonLoader(); STORE JoinedData INTO 'selina.joined_events_e' USING org.apache.hive.hcatalog.pig.HCatStorer('author=selina'); {code} The table directory looks like {noformat} drwx-- - selinaz users 0 2015-05-22 21:19 /user/selinaz/joined_events_e/_SCRATCH0.9157208938193798 drwx-- - selinaz users 0 2015-05-22 21:19 /user/selinaz/joined_events_e/author=selina {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10809) HCat FileOutputCommitterContainer leaves behind empty _SCRATCH directories
[ https://issues.apache.org/jira/browse/HIVE-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557498#comment-14557498 ] Selina Zhang commented on HIVE-10809: - Thanks! Will do! HCat FileOutputCommitterContainer leaves behind empty _SCRATCH directories -- Key: HIVE-10809 URL: https://issues.apache.org/jira/browse/HIVE-10809 Project: Hive Issue Type: Bug Components: HCatalog Affects Versions: 1.2.0 Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10809.1.patch When static partition is added through HCatStorer or HCatWriter {code} JoinedData = LOAD '/user/selinaz/data/part-r-0' USING JsonLoader(); STORE JoinedData INTO 'selina.joined_events_e' USING org.apache.hive.hcatalog.pig.HCatStorer('author=selina'); {code} The table directory looks like {noformat} drwx-- - selinaz users 0 2015-05-22 21:19 /user/selinaz/joined_events_e/_SCRATCH0.9157208938193798 drwx-- - selinaz users 0 2015-05-22 21:19 /user/selinaz/joined_events_e/author=selina {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10809) HCat FileOutputCommitterContainer leaves behind empty _SCRATCH directories
[ https://issues.apache.org/jira/browse/HIVE-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Selina Zhang updated HIVE-10809: Attachment: HIVE-10809.1.patch HCat FileOutputCommitterContainer leaves behind empty _SCRATCH directories -- Key: HIVE-10809 URL: https://issues.apache.org/jira/browse/HIVE-10809 Project: Hive Issue Type: Bug Components: HCatalog Affects Versions: 1.2.0 Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10809.1.patch When static partition is added through HCatStorer or HCatWriter {code} JoinedData = LOAD '/user/selinaz/data/part-r-0' USING JsonLoader(); STORE JoinedData INTO 'selina.joined_events_e' USING org.apache.hive.hcatalog.pig.HCatStorer('author=selina'); {code} The table directory looks like {noformat} drwx-- - selinaz users 0 2015-05-22 21:19 /user/selinaz/joined_events_e/_SCRATCH0.9157208938193798 drwx-- - selinaz users 0 2015-05-22 21:19 /user/selinaz/joined_events_e/author=selina {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10729) Query failed when select complex columns from joinned table (tez map join only)
[ https://issues.apache.org/jira/browse/HIVE-10729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555069#comment-14555069 ] Selina Zhang commented on HIVE-10729: - The above unit test failure seems not relevant to this patch. Query failed when select complex columns from joinned table (tez map join only) --- Key: HIVE-10729 URL: https://issues.apache.org/jira/browse/HIVE-10729 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 1.2.0 Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10729.1.patch, HIVE-10729.2.patch When map join happens, if projection columns include complex data types, query will fail. Steps to reproduce: {code:sql} hive set hive.auto.convert.join; hive.auto.convert.join=true hive desc foo; a arrayint hive select * from foo; [1,2] hive desc src_int; key int value string hive select * from src_int where key=2; 2val_2 hive select * from foo join src_int src on src.key = foo.a[1]; {code} Query will fail with stack trace {noformat} Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryArray cannot be cast to [Ljava.lang.Object; at org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getList(StandardListObjectInspector.java:111) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:314) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:262) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:246) at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:50) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:692) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:644) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:676) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:754) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:386) ... 23 more {noformat} Similar error when projection columns include a map: {code:sql} hive CREATE TABLE test (a INT, b MAPINT, STRING) STORED AS ORC; hive INSERT OVERWRITE TABLE test SELECT 1, MAP(1, val_1, 2, val_2) FROM src LIMIT 1; hive select * from src join test where src.key=test.a; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10729) Query failed when select complex columns from joinned table (tez map join only)
[ https://issues.apache.org/jira/browse/HIVE-10729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Selina Zhang updated HIVE-10729: Attachment: HIVE-10729.2.patch added pre-size the array list to fields size Query failed when select complex columns from joinned table (tez map join only) --- Key: HIVE-10729 URL: https://issues.apache.org/jira/browse/HIVE-10729 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 1.2.0 Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10729.1.patch, HIVE-10729.2.patch When map join happens, if projection columns include complex data types, query will fail. Steps to reproduce: {code:sql} hive set hive.auto.convert.join; hive.auto.convert.join=true hive desc foo; a arrayint hive select * from foo; [1,2] hive desc src_int; key int value string hive select * from src_int where key=2; 2val_2 hive select * from foo join src_int src on src.key = foo.a[1]; {code} Query will fail with stack trace {noformat} Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryArray cannot be cast to [Ljava.lang.Object; at org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getList(StandardListObjectInspector.java:111) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:314) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:262) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:246) at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:50) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:692) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:644) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:676) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:754) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:386) ... 23 more {noformat} Similar error when projection columns include a map: {code:sql} hive CREATE TABLE test (a INT, b MAPINT, STRING) STORED AS ORC; hive INSERT OVERWRITE TABLE test SELECT 1, MAP(1, val_1, 2, val_2) FROM src LIMIT 1; hive select * from src join test where src.key=test.a; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10729) Query failed when select complex columns from joinned table (tez map join only)
[ https://issues.apache.org/jira/browse/HIVE-10729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14552853#comment-14552853 ] Selina Zhang commented on HIVE-10729: - Yes. Will do. Query failed when select complex columns from joinned table (tez map join only) --- Key: HIVE-10729 URL: https://issues.apache.org/jira/browse/HIVE-10729 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 1.2.0 Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10729.1.patch When map join happens, if projection columns include complex data types, query will fail. Steps to reproduce: {code:sql} hive set hive.auto.convert.join; hive.auto.convert.join=true hive desc foo; a arrayint hive select * from foo; [1,2] hive desc src_int; key int value string hive select * from src_int where key=2; 2val_2 hive select * from foo join src_int src on src.key = foo.a[1]; {code} Query will fail with stack trace {noformat} Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryArray cannot be cast to [Ljava.lang.Object; at org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getList(StandardListObjectInspector.java:111) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:314) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:262) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:246) at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:50) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:692) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:644) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:676) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:754) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:386) ... 23 more {noformat} Similar error when projection columns include a map: {code:sql} hive CREATE TABLE test (a INT, b MAPINT, STRING) STORED AS ORC; hive INSERT OVERWRITE TABLE test SELECT 1, MAP(1, val_1, 2, val_2) FROM src LIMIT 1; hive select * from src join test where src.key=test.a; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10729) Query failed when select complex columns from joinned table (tez map join only)
[ https://issues.apache.org/jira/browse/HIVE-10729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Selina Zhang updated HIVE-10729: Summary: Query failed when select complex columns from joinned table (tez map join only) (was: Query failed when join table with complex types (tez map join only)) Query failed when select complex columns from joinned table (tez map join only) --- Key: HIVE-10729 URL: https://issues.apache.org/jira/browse/HIVE-10729 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 1.2.0 Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10729.1.patch When map join happens, if projection columns from the small table are complex data types, query will fail. Steps to reproduce: {code:sql} hive set hive.auto.convert.join; hive.auto.convert.join=true hive desc foo; a arrayint hive select * from foo; [1,2] hive desc src_int; key int value string hive select * from src_int where key=2; 2val_2 hive select * from foo join src on src.key = foo.a[1]; {code} Query will fail with stack trace {noformat} Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryArray cannot be cast to [Ljava.lang.Object; at org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getList(StandardListObjectInspector.java:111) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:314) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:262) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:246) at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:50) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:692) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:644) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:676) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:754) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:386) ... 23 more {noformat} Similar error when join on a map key: {code:sql} hive CREATE TABLE test (a INT, b MAPINT, STRING) STORED AS ORC; hive INSERT OVERWRITE TABLE test SELECT 1, MAP(1, val_1, 2, val_2) FROM src LIMIT 1; hive select * from src join test where src.value=test.b[2]; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10729) Query failed when join table with complex types (tez map join only)
[ https://issues.apache.org/jira/browse/HIVE-10729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Selina Zhang updated HIVE-10729: Summary: Query failed when join table with complex types (tez map join only) (was: Query failed when join on an element in complex type (tez map join only)) Query failed when join table with complex types (tez map join only) --- Key: HIVE-10729 URL: https://issues.apache.org/jira/browse/HIVE-10729 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 1.2.0 Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10729.1.patch Steps to reproduce: {code:sql} hive set hive.auto.convert.join; hive.auto.convert.join=true hive desc foo; a arrayint hive select * from foo; [1,2] hive desc src_int; key int value string hive select * from src_int where key=2; 2val_2 hive select * from foo join src on src.key = foo.a[1]; {code} Query will fail with stack trace {noformat} Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryArray cannot be cast to [Ljava.lang.Object; at org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getList(StandardListObjectInspector.java:111) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:314) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:262) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:246) at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:50) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:692) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:644) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:676) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:754) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:386) ... 23 more {noformat} Similar error when join on a map key: {code:sql} hive CREATE TABLE test (a INT, b MAPINT, STRING) STORED AS ORC; hive INSERT OVERWRITE TABLE test SELECT 1, MAP(1, val_1, 2, val_2) FROM src LIMIT 1; hive select * from src join test where src.value=test.b[2]; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10729) Query failed when select complex columns from joinned table (tez map join only)
[ https://issues.apache.org/jira/browse/HIVE-10729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Selina Zhang updated HIVE-10729: Description: When map join happens, if projection columns are complex data types, query will fail. Steps to reproduce: {code:sql} hive set hive.auto.convert.join; hive.auto.convert.join=true hive desc foo; a arrayint hive select * from foo; [1,2] hive desc src_int; key int value string hive select * from src_int where key=2; 2 val_2 hive select * from foo join src on src.key = foo.a[1]; {code} Query will fail with stack trace {noformat} Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryArray cannot be cast to [Ljava.lang.Object; at org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getList(StandardListObjectInspector.java:111) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:314) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:262) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:246) at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:50) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:692) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:644) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:676) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:754) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:386) ... 23 more {noformat} Similar error when join on a map key: {code:sql} hive CREATE TABLE test (a INT, b MAPINT, STRING) STORED AS ORC; hive INSERT OVERWRITE TABLE test SELECT 1, MAP(1, val_1, 2, val_2) FROM src LIMIT 1; hive select * from src join test where src.key=test.a; {code} was: When map join happens, if projection columns from the small table are complex data types, query will fail. Steps to reproduce: {code:sql} hive set hive.auto.convert.join; hive.auto.convert.join=true hive desc foo; a arrayint hive select * from foo; [1,2] hive desc src_int; key int value string hive select * from src_int where key=2; 2 val_2 hive select * from foo join src on src.key = foo.a[1]; {code} Query will fail with stack trace {noformat} Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryArray cannot be cast to [Ljava.lang.Object; at org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getList(StandardListObjectInspector.java:111) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:314) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:262) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:246) at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:50) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:692) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:644) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:676) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:754) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:386) ... 23 more {noformat} Similar error when join on a map key: {code:sql} hive CREATE TABLE test (a INT, b MAPINT, STRING) STORED AS ORC; hive INSERT OVERWRITE TABLE test SELECT 1, MAP(1, val_1, 2, val_2) FROM src LIMIT 1; hive select * from src join test where src.value=test.b[2]; {code} Query failed when select complex columns from joinned table (tez map join only) --- Key: HIVE-10729 URL:
[jira] [Updated] (HIVE-10729) Query failed when join on an element in complex type (tez map join only)
[ https://issues.apache.org/jira/browse/HIVE-10729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Selina Zhang updated HIVE-10729: Description: Steps to reproduce: {code:sql} hive set hive.auto.convert.join; hive.auto.convert.join=true hive desc foo; a arrayint hive select * from foo; [1,2] hive desc src_int; key int value string hive select * from src_int where key=2; 2 val_2 hive select * from foo join src on src.key = foo.a[1]; {code} Query will fail with stack trace {noformat} Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryArray cannot be cast to [Ljava.lang.Object; at org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getList(StandardListObjectInspector.java:111) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:314) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:262) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:246) at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:50) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:692) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:644) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:676) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:754) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:386) ... 23 more {noformat} Similar error when join on a map key: {code:sql} hive CREATE TABLE test (a INT, b MAPINT, STRING) STORED AS ORC; hive INSERT OVERWRITE TABLE test SELECT 1, MAP(1, val_1, 2, val_2) FROM src LIMIT 1; hive select * from src join test where src.value=test.b[2]; {code} Query failed when join on an element in complex type (tez map join only) Key: HIVE-10729 URL: https://issues.apache.org/jira/browse/HIVE-10729 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 1.2.0 Reporter: Selina Zhang Assignee: Selina Zhang Steps to reproduce: {code:sql} hive set hive.auto.convert.join; hive.auto.convert.join=true hive desc foo; a arrayint hive select * from foo; [1,2] hive desc src_int; key int value string hive select * from src_int where key=2; 2val_2 hive select * from foo join src on src.key = foo.a[1]; {code} Query will fail with stack trace {noformat} Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryArray cannot be cast to [Ljava.lang.Object; at org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getList(StandardListObjectInspector.java:111) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:314) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:262) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:246) at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:50) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:692) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:644) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:676) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:754) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:386) ... 23 more {noformat} Similar error when join on a map key: {code:sql} hive CREATE TABLE test (a INT, b MAPINT, STRING) STORED AS ORC; hive INSERT OVERWRITE TABLE test SELECT 1, MAP(1, val_1, 2, val_2) FROM src LIMIT 1; hive select * from src join test where
[jira] [Commented] (HIVE-10308) Vectorization execution throws java.lang.IllegalArgumentException: Unsupported complex type: MAP
[ https://issues.apache.org/jira/browse/HIVE-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544611#comment-14544611 ] Selina Zhang commented on HIVE-10308: - [~mmccline] Thanks for reviewing this! Which patch fixed this problem? Vectorization execution throws java.lang.IllegalArgumentException: Unsupported complex type: MAP Key: HIVE-10308 URL: https://issues.apache.org/jira/browse/HIVE-10308 Project: Hive Issue Type: Bug Components: Vectorization Affects Versions: 0.14.0, 0.13.1, 1.2.0, 1.1.0 Reporter: Selina Zhang Assignee: Matt McCline Attachments: HIVE-10308.1.patch Steps to reproduce: {code:sql} CREATE TABLE test_orc (a INT, b MAPINT, STRING) STORED AS ORC; INSERT OVERWRITE TABLE test_orc SELECT 1, MAP(1, one, 2, two) FROM src LIMIT 1; CREATE TABLE test(key INT) ; INSERT OVERWRITE TABLE test SELECT 1 FROM src LIMIT 1; set hive.vectorized.execution.enabled=true; set hive.auto.convert.join=false; select l.key from test l left outer join test_orc r on (l.key= r.a) where r.a is not null; {code} Stack trace: {noformat} Caused by: java.lang.IllegalArgumentException: Unsupported complex type: MAP at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpressionWriterFactory.genVectorExpressionWritable(VectorExpressionWriterFactory.java:456) at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpressionWriterFactory.processVectorInspector(VectorExpressionWriterFactory.java:1191) at org.apache.hadoop.hive.ql.exec.vector.VectorReduceSinkOperator.initializeOp(VectorReduceSinkOperator.java:58) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:362) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:481) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:438) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375) at org.apache.hadoop.hive.ql.exec.MapOperator.initializeMapOperator(MapOperator.java:442) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:198) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537540#comment-14537540 ] Selina Zhang commented on HIVE-10036: - The above two unit test failures seem irrelevant to this patch. Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Labels: orcfile Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch, HIVE-10036.7.patch, HIVE-10036.8.patch, HIVE-10036.9.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Selina Zhang updated HIVE-10036: Attachment: HIVE-10036.9.patch Fixed the unit tests. Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Labels: orcfile Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch, HIVE-10036.7.patch, HIVE-10036.8.patch, HIVE-10036.9.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Selina Zhang updated HIVE-10036: Attachment: HIVE-10036.8.patch Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Labels: orcfile Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch, HIVE-10036.7.patch, HIVE-10036.8.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Selina Zhang updated HIVE-10036: Attachment: HIVE-10036.8.patch Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Labels: orcfile Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch, HIVE-10036.7.patch, HIVE-10036.8.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526023#comment-14526023 ] Selina Zhang commented on HIVE-10036: - [~gopalv] Thank you! I included the io.netty to ql/pom.xml and uploaded a new patch. Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Labels: orcfile Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch, HIVE-10036.7.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Selina Zhang updated HIVE-10036: Attachment: HIVE-10036.7.patch Fixed ql/pom.xml Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Labels: orcfile Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch, HIVE-10036.7.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10308) Vectorization execution throws java.lang.IllegalArgumentException: Unsupported complex type: MAP
[ https://issues.apache.org/jira/browse/HIVE-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493272#comment-14493272 ] Selina Zhang commented on HIVE-10308: - Actually the test results show only 1 test failure and the failure is 33 days old. So should be irrelevant. Vectorization execution throws java.lang.IllegalArgumentException: Unsupported complex type: MAP Key: HIVE-10308 URL: https://issues.apache.org/jira/browse/HIVE-10308 Project: Hive Issue Type: Bug Components: Vectorization Affects Versions: 0.14.0, 0.13.1, 1.2.0, 1.1.0 Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10308.1.patch Steps to reproduce: CREATE TABLE test_orc (a INT, b MAPINT, STRING) STORED AS ORC; INSERT OVERWRITE TABLE test_orc SELECT 1, MAP(1, one, 2, two) FROM src LIMIT 1; CREATE TABLE test(key INT) ; INSERT OVERWRITE TABLE test SELECT 1 FROM src LIMIT 1; set hive.vectorized.execution.enabled=true; set hive.auto.convert.join=false; select l.key from test l left outer join test_orc r on (l.key= r.a) where r.a is not null; Stack trace: Caused by: java.lang.IllegalArgumentException: Unsupported complex type: MAP at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpressionWriterFactory.genVectorExpressionWritable(VectorExpressionWriterFactory.java:456) at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpressionWriterFactory.processVectorInspector(VectorExpressionWriterFactory.java:1191) at org.apache.hadoop.hive.ql.exec.vector.VectorReduceSinkOperator.initializeOp(VectorReduceSinkOperator.java:58) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:362) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:481) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:438) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375) at org.apache.hadoop.hive.ql.exec.MapOperator.initializeMapOperator(MapOperator.java:442) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:198) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Selina Zhang updated HIVE-10036: Attachment: HIVE-10036.5.patch Thanks Mithun and Prasanth! Uploaded modified patch. Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch, HIVE-10036.5.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10308) Vectorization execution throws java.lang.IllegalArgumentException: Unsupported complex type: MAP
[ https://issues.apache.org/jira/browse/HIVE-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Selina Zhang updated HIVE-10308: Attachment: HIVE-10308.1.patch Vectorization execution throws java.lang.IllegalArgumentException: Unsupported complex type: MAP Key: HIVE-10308 URL: https://issues.apache.org/jira/browse/HIVE-10308 Project: Hive Issue Type: Bug Components: Vectorization Affects Versions: 0.14.0, 0.13.1, 1.2.0, 1.1.0 Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10308.1.patch Steps to reproduce: CREATE TABLE test_orc (a INT, b MAPINT, STRING) STORED AS ORC; INSERT OVERWRITE TABLE test_orc SELECT 1, MAP(1, one, 2, two) FROM src LIMIT 1; CREATE TABLE test(key INT) ; INSERT OVERWRITE TABLE test SELECT 1 FROM src LIMIT 1; set hive.vectorized.execution.enabled=true; set hive.auto.convert.join=false; select l.key from test l left outer join test_orc r on (l.key= r.a) where r.a is not null; Stack trace: Caused by: java.lang.IllegalArgumentException: Unsupported complex type: MAP at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpressionWriterFactory.genVectorExpressionWritable(VectorExpressionWriterFactory.java:456) at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpressionWriterFactory.processVectorInspector(VectorExpressionWriterFactory.java:1191) at org.apache.hadoop.hive.ql.exec.vector.VectorReduceSinkOperator.initializeOp(VectorReduceSinkOperator.java:58) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:362) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:481) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:438) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375) at org.apache.hadoop.hive.ql.exec.MapOperator.initializeMapOperator(MapOperator.java:442) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:198) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10089) RCFile: lateral view explode caused ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HIVE-10089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Selina Zhang updated HIVE-10089: Attachment: HIVE-10089.1.patch RCFile: lateral view explode caused ConcurrentModificationException --- Key: HIVE-10089 URL: https://issues.apache.org/jira/browse/HIVE-10089 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10089.1.patch CREATE TABLE test_table123 (a INT, b MAPSTRING, STRING) STORED AS RCFILE; INSERT OVERWRITE TABLE test_table123 SELECT 1, MAP(a1, b1, c1, d1) FROM src LIMIT 1; The following query will lead to ConcurrentModificationException SELECT * FROM (SELECT b FROM test_table123) t1 LATERAL VIEW explode(b) x AS b,c LIMIT 1; Failed with exception java.io.IOException:java.util.ConcurrentModificationException -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378863#comment-14378863 ] Selina Zhang commented on HIVE-10036: - [~gopalv] I checked my maven repository, it seems avro 1.7.5 pulled in the netty-3.4.0.Final, which override the one defined in main pom.xml. Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378176#comment-14378176 ] Selina Zhang commented on HIVE-10036: - The review request: https://reviews.apache.org/r/32445/ Thank you! Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Selina Zhang updated HIVE-10036: Attachment: HIVE-10036.2.patch Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Selina Zhang updated HIVE-10036: Attachment: HIVE-10036.1.patch Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10036.1.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)