[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files
[ https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604902#comment-14604902 ] Gopal V commented on HIVE-11043: [~leftylev]: yes, it needs doc - I will write up a "decision" tree of the hybrid strategy for the docs. > ORC split strategies should adapt based on number of files > -- > > Key: HIVE-11043 > URL: https://issues.apache.org/jira/browse/HIVE-11043 > Project: Hive > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Prasanth Jayachandran >Assignee: Gopal V > Fix For: 2.0.0 > > Attachments: HIVE-11043.1.patch, HIVE-11043.2.patch, > HIVE-11043.3.patch > > > ORC split strategies added in HIVE-10114 chose strategies based on average > file size. It would be beneficial to choose a different strategy based on > number of files as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files
[ https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604841#comment-14604841 ] Lefty Leverenz commented on HIVE-11043: --- Does this need documentation? Also, shouldn't Fix Version include 1.3.0 (commit 64f8e0f069f71f82518a9280d199f790174bee33 to branch-1)? > ORC split strategies should adapt based on number of files > -- > > Key: HIVE-11043 > URL: https://issues.apache.org/jira/browse/HIVE-11043 > Project: Hive > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Prasanth Jayachandran >Assignee: Gopal V > Fix For: 2.0.0 > > Attachments: HIVE-11043.1.patch, HIVE-11043.2.patch, > HIVE-11043.3.patch > > > ORC split strategies added in HIVE-10114 chose strategies based on average > file size. It would be beneficial to choose a different strategy based on > number of files as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files
[ https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600875#comment-14600875 ] Hive QA commented on HIVE-11043: {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12741727/HIVE-11043.3.patch {color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 9025 tests executed *Failed tests:* {noformat} org.apache.hive.hcatalog.streaming.TestStreaming.testAddPartition {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4377/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4377/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-4377/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12741727 - PreCommit-HIVE-TRUNK-Build > ORC split strategies should adapt based on number of files > -- > > Key: HIVE-11043 > URL: https://issues.apache.org/jira/browse/HIVE-11043 > Project: Hive > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Prasanth Jayachandran >Assignee: Gopal V > Fix For: 2.0.0 > > Attachments: HIVE-11043.1.patch, HIVE-11043.2.patch, > HIVE-11043.3.patch > > > ORC split strategies added in HIVE-10114 chose strategies based on average > file size. It would be beneficial to choose a different strategy based on > number of files as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files
[ https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600320#comment-14600320 ] Gopal V commented on HIVE-11043: Filed HIVE-11102 to track the issue - this error is not related to this patch, but has been exposed by this patch (i.e picking ETL Strategy instead of BI). > ORC split strategies should adapt based on number of files > -- > > Key: HIVE-11043 > URL: https://issues.apache.org/jira/browse/HIVE-11043 > Project: Hive > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Prasanth Jayachandran >Assignee: Gopal V > Fix For: 2.0.0 > > Attachments: HIVE-11043.1.patch, HIVE-11043.2.patch > > > ORC split strategies added in HIVE-10114 chose strategies based on average > file size. It would be beneficial to choose a different strategy based on > number of files as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files
[ https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598893#comment-14598893 ] Gopal V commented on HIVE-11043: [~prasanth_j]: sure, looks like errors when reading footers for the 1 file/1 split case. The error is actually {code} Caused by: java.lang.IndexOutOfBoundsException: Index: 0 at java.util.Collections$EmptyList.get(Collections.java:3212) at org.apache.hadoop.hive.ql.io.orc.OrcProto$Type.getSubtypes(OrcProto.java:12240) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.getColumnIndicesFromNames(ReaderImpl.java:651) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.getRawDataSizeOfColumns(ReaderImpl.java:634) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:938) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:847) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:713) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} > ORC split strategies should adapt based on number of files > -- > > Key: HIVE-11043 > URL: https://issues.apache.org/jira/browse/HIVE-11043 > Project: Hive > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Prasanth Jayachandran >Assignee: Gopal V > Fix For: 2.0.0 > > Attachments: HIVE-11043.1.patch, HIVE-11043.2.patch > > > ORC split strategies added in HIVE-10114 chose strategies based on average > file size. It would be beneficial to choose a different strategy based on > number of files as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files
[ https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598799#comment-14598799 ] Pengcheng Xiong commented on HIVE-11043: [~prasanth_j] and [~gopalv], [~jpullokkaran] asked me to track the recent constant test cases failing on master and I came here. It seems that this patch causes the problem. At the first sight, authorization_delete.q sounds unrelated. However, it includes creating a table stored as ORC. If I revert this patch, the test cases can pass. Could you guys take a look? Thanks. > ORC split strategies should adapt based on number of files > -- > > Key: HIVE-11043 > URL: https://issues.apache.org/jira/browse/HIVE-11043 > Project: Hive > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Prasanth Jayachandran >Assignee: Gopal V > Fix For: 2.0.0 > > Attachments: HIVE-11043.1.patch, HIVE-11043.2.patch > > > ORC split strategies added in HIVE-10114 chose strategies based on average > file size. It would be beneficial to choose a different strategy based on > number of files as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files
[ https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597295#comment-14597295 ] Prasanth Jayachandran commented on HIVE-11043: -- LGTM, +1. I don't think the test failures are related. > ORC split strategies should adapt based on number of files > -- > > Key: HIVE-11043 > URL: https://issues.apache.org/jira/browse/HIVE-11043 > Project: Hive > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Prasanth Jayachandran >Assignee: Gopal V > Fix For: 2.0.0 > > Attachments: HIVE-11043.1.patch, HIVE-11043.2.patch > > > ORC split strategies added in HIVE-10114 chose strategies based on average > file size. It would be beneficial to choose a different strategy based on > number of files as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files
[ https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597289#comment-14597289 ] Hive QA commented on HIVE-11043: {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12741166/HIVE-11043.2.patch {color:red}ERROR:{color} -1 due to 7 failed/errored test(s), 9014 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_delete org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_delete_own_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_update org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_update_own_table org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_outer_join1 org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_outer_join4 org.apache.hive.hcatalog.pig.TestHCatStorer.testEmptyStore[3] {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4345/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4345/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-4345/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 7 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12741166 - PreCommit-HIVE-TRUNK-Build > ORC split strategies should adapt based on number of files > -- > > Key: HIVE-11043 > URL: https://issues.apache.org/jira/browse/HIVE-11043 > Project: Hive > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Prasanth Jayachandran >Assignee: Gopal V > Fix For: 2.0.0 > > Attachments: HIVE-11043.1.patch, HIVE-11043.2.patch > > > ORC split strategies added in HIVE-10114 chose strategies based on average > file size. It would be beneficial to choose a different strategy based on > number of files as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files
[ https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596843#comment-14596843 ] Gopal V commented on HIVE-11043: bq. 3) ... In which case we will end up using BI as default even though there are only small number of files. bq. 5) Should we make this independently configurable? Instead of using the cache max size. The max cache size is a safety limit for huge clusters, it is not a configuration requirement. If you need to change the behaviour explicitly, the right config to change is the strategy used (between ETL/BI) to select whichever one's the preferred one. > ORC split strategies should adapt based on number of files > -- > > Key: HIVE-11043 > URL: https://issues.apache.org/jira/browse/HIVE-11043 > Project: Hive > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Prasanth Jayachandran >Assignee: Gopal V > Fix For: 2.0.0 > > Attachments: HIVE-11043.1.patch > > > ORC split strategies added in HIVE-10114 chose strategies based on average > file size. It would be beneficial to choose a different strategy based on > number of files as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files
[ https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593575#comment-14593575 ] Prasanth Jayachandran commented on HIVE-11043: -- Mostly looks good. Few questions/comments: 1) Can we use the same default for numSplits as MR? 1 instead of -1. This will make ETL strategy the default even in the presence of single small file. {code} return generateSplitsInfo(conf, -1); {code} 2) The condition should be numFiles <= context.minSplits right? This will avoid choosing BI in the case of 1 small file. 3) I tried some queries and numSplits arg in getSplits() can become 0. In which case we will end up using BI as default even though there are only small number of files. 4) Some more tests for these corner cases will be helpful. 5) Should we make this independently configurable? Instead of using the cache max size. > ORC split strategies should adapt based on number of files > -- > > Key: HIVE-11043 > URL: https://issues.apache.org/jira/browse/HIVE-11043 > Project: Hive > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Prasanth Jayachandran >Assignee: Gopal V > Fix For: 2.0.0 > > Attachments: HIVE-11043.1.patch > > > ORC split strategies added in HIVE-10114 chose strategies based on average > file size. It would be beneficial to choose a different strategy based on > number of files as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files
[ https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593104#comment-14593104 ] Hive QA commented on HIVE-11043: {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12740561/HIVE-11043.1.patch {color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 9010 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_outer_join4 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join28 {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4317/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4317/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-4317/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12740561 - PreCommit-HIVE-TRUNK-Build > ORC split strategies should adapt based on number of files > -- > > Key: HIVE-11043 > URL: https://issues.apache.org/jira/browse/HIVE-11043 > Project: Hive > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Prasanth Jayachandran >Assignee: Gopal V > Fix For: 2.0.0 > > Attachments: HIVE-11043.1.patch > > > ORC split strategies added in HIVE-10114 chose strategies based on average > file size. It would be beneficial to choose a different strategy based on > number of files as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)