[jira] [Updated] (HIVE-17404) Orc split generation cache does not handle files without file tail
[ https://issues.apache.org/jira/browse/HIVE-17404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanth Jayachandran updated HIVE-17404: - Attachment: HIVE-17404.2.patch > Orc split generation cache does not handle files without file tail > -- > > Key: HIVE-17404 > URL: https://issues.apache.org/jira/browse/HIVE-17404 > Project: Hive > Issue Type: Bug >Affects Versions: 3.0.0, 2.4.0 >Reporter: Prasanth Jayachandran >Assignee: Aditya Shah >Priority: Critical > Attachments: HIVE-17404.2.patch, HIVE-17404.patch > > > Some old files do not have Orc FileTail. If file tail does not exist, split > generation should fallback to old way of storing footers. > This can result in exceptions like below > {code} > ORC split generation failed with exception: Malformed ORC file. Invalid > postscript length 9 > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1735) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1822) > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:450) > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:569) > at > org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.orc.FileFormatException: Malformed ORC file. Invalid > postscript length 9 > at org.apache.orc.impl.ReaderImpl.ensureOrcFooter(ReaderImpl.java:297) > at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:470) > at > org.apache.hadoop.hive.ql.io.orc.LocalCache.getAndValidate(LocalCache.java:103) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.getSplits(OrcInputFormat.java:804) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.runGetSplitsSync(OrcInputFormat.java:922) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.generateSplitWork(OrcInputFormat.java:891) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.scheduleSplits(OrcInputFormat.java:1763) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1707) > ... 15 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-17404) Orc split generation cache does not handle files without file tail
[ https://issues.apache.org/jira/browse/HIVE-17404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditya Shah updated HIVE-17404: --- Attachment: HIVE-17404.patch Target Version/s: (was: 3.0.0, 2.4.0) Status: Patch Available (was: Open) Have submitted a patch which Adds a check for ORC bytes in Orctail before putting it in the local cache. This issue was faced because in HIVE-16133 we minimize the tail data stored in the cache. This cause a call to extractTails which rebuilds the OrcTail while using it. This further causes a check for footer and results in an error being thrown. Because for old orc files when the tail is not present we check the head for the “ORC” text, but in the case where we just have a tail as in this call, it causes an exception. cc [~prasanth_j] [~rajesh.balamohan] [~andrewom] > Orc split generation cache does not handle files without file tail > -- > > Key: HIVE-17404 > URL: https://issues.apache.org/jira/browse/HIVE-17404 > Project: Hive > Issue Type: Bug >Affects Versions: 3.0.0, 2.4.0 >Reporter: Prasanth Jayachandran >Assignee: Aditya Shah >Priority: Critical > Attachments: HIVE-17404.patch > > > Some old files do not have Orc FileTail. If file tail does not exist, split > generation should fallback to old way of storing footers. > This can result in exceptions like below > {code} > ORC split generation failed with exception: Malformed ORC file. Invalid > postscript length 9 > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1735) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1822) > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:450) > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:569) > at > org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.orc.FileFormatException: Malformed ORC file. Invalid > postscript length 9 > at org.apache.orc.impl.ReaderImpl.ensureOrcFooter(ReaderImpl.java:297) > at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:470) > at > org.apache.hadoop.hive.ql.io.orc.LocalCache.getAndValidate(LocalCache.java:103) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.getSplits(OrcInputFormat.java:804) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.runGetSplitsSync(OrcInputFormat.java:922) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.generateSplitWork(OrcInputFormat.java:891) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.scheduleSplits(OrcInputFormat.java:1763) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1707) > ... 15 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-17404) Orc split generation cache does not handle files without file tail
[ https://issues.apache.org/jira/browse/HIVE-17404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanth Jayachandran updated HIVE-17404: - Priority: Critical (was: Blocker) > Orc split generation cache does not handle files without file tail > -- > > Key: HIVE-17404 > URL: https://issues.apache.org/jira/browse/HIVE-17404 > Project: Hive > Issue Type: Bug >Affects Versions: 3.0.0, 2.4.0 >Reporter: Prasanth Jayachandran >Assignee: Prasanth Jayachandran >Priority: Critical > > Some old files do not have Orc FileTail. If file tail does not exist, split > generation should fallback to old way of storing footers. > This can result in exceptions like below > {code} > ORC split generation failed with exception: Malformed ORC file. Invalid > postscript length 9 > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1735) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1822) > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:450) > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:569) > at > org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.orc.FileFormatException: Malformed ORC file. Invalid > postscript length 9 > at org.apache.orc.impl.ReaderImpl.ensureOrcFooter(ReaderImpl.java:297) > at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:470) > at > org.apache.hadoop.hive.ql.io.orc.LocalCache.getAndValidate(LocalCache.java:103) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.getSplits(OrcInputFormat.java:804) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.runGetSplitsSync(OrcInputFormat.java:922) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.generateSplitWork(OrcInputFormat.java:891) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.scheduleSplits(OrcInputFormat.java:1763) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1707) > ... 15 more > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17404) Orc split generation cache does not handle files without file tail
[ https://issues.apache.org/jira/browse/HIVE-17404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanth Jayachandran updated HIVE-17404: - Summary: Orc split generation cache does not handle files without file tail (was: Orc split generation cache does not handle files with file tail) > Orc split generation cache does not handle files without file tail > -- > > Key: HIVE-17404 > URL: https://issues.apache.org/jira/browse/HIVE-17404 > Project: Hive > Issue Type: Bug >Affects Versions: 3.0.0, 2.4.0 >Reporter: Prasanth Jayachandran >Assignee: Prasanth Jayachandran >Priority: Blocker > > Some old files do not have Orc FileTail. If file tail does not exist, split > generation should fallback to old way of storing footers. > This can result in exceptions like below > {code} > ORC split generation failed with exception: Malformed ORC file. Invalid > postscript length 9 > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1735) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1822) > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:450) > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:569) > at > org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.orc.FileFormatException: Malformed ORC file. Invalid > postscript length 9 > at org.apache.orc.impl.ReaderImpl.ensureOrcFooter(ReaderImpl.java:297) > at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:470) > at > org.apache.hadoop.hive.ql.io.orc.LocalCache.getAndValidate(LocalCache.java:103) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.getSplits(OrcInputFormat.java:804) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.runGetSplitsSync(OrcInputFormat.java:922) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.generateSplitWork(OrcInputFormat.java:891) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.scheduleSplits(OrcInputFormat.java:1763) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1707) > ... 15 more > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)