[jira] [Commented] (HADOOP-6852) apparent bug in concatenated-bzip2 support (decoding)
[ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371926#comment-16371926 ] Sean Mackrory commented on HADOOP-6852: --- Committed - thanks for getting to the bottom of this long-standing issue! > apparent bug in concatenated-bzip2 support (decoding) > - > > Key: HADOOP-6852 > URL: https://issues.apache.org/jira/browse/HADOOP-6852 > Project: Hadoop Common > Issue Type: Bug > Components: io >Affects Versions: 0.22.0 > Environment: Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15 >Reporter: Greg Roelofs >Assignee: Zsolt Venczel >Priority: Major > Fix For: 3.2.0 > > Attachments: HADOOP-6852.01.patch, HADOOP-6852.02.patch, > HADOOP-6852.03.patch, HADOOP-6852.04.patch > > > The following simplified code (manually picked out of testMoreBzip2() in > https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch) > triggers a "java.io.IOException: bad block header" in > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock( > CBZip2InputStream.java:527): > {noformat} > JobConf jobConf = new JobConf(defaultConf); > CompressionCodec bzip2 = new BZip2Codec(); > ReflectionUtils.setConf(bzip2, jobConf); > localFs.delete(workDir, true); > // copy multiple-member test file to HDFS > String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension(); > Path fnLocal2 = new > Path(System.getProperty("test.concat.data","/tmp"),fn2); > Path fnHDFS2 = new Path(workDir, fn2); > localFs.copyFromLocalFile(fnLocal2, fnHDFS2); > FileInputFormat.setInputPaths(jobConf, workDir); > final FileInputStream in2 = new FileInputStream(fnLocal2.toString()); > CompressionInputStream cin2 = bzip2.createInputStream(in2); > LineReader in = new LineReader(cin2); > Text out = new Text(); > int numBytes, totalBytes=0, lineNum=0; > while ((numBytes = in.readLine(out)) > 0) { > ++lineNum; > totalBytes += numBytes; > } > in.close(); > {noformat} > The specified file is also included in the H-6835 patch linked above, and > some additional debug output is included in the commented-out test loop > above. (Only in the linked, "v4" version of the patch, however--I'm about to > remove the debug stuff for checkin.) > It's possible I've done something completely boneheaded here, but the file, > at least, checks out in a subsequent set of subtests and with stock bzip2 > itself. Only the code above is problematic; it reads through the first > concatenated chunk (17 lines of text) just fine but chokes on the header of > the second one. Altogether, the test file contains 84 lines of text and 4 > concatenated bzip2 files. > (It's possible this is a mapreduce issue rather than common, but note that > the identical gzip test works fine. Possibly it's related to the > stream-vs-decompressor dichotomy, though; intentionally not supported?) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-6852) apparent bug in concatenated-bzip2 support (decoding)
[ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371951#comment-16371951 ] Hudson commented on HADOOP-6852: SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13698 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/13698/]) HADOOP-6852. apparent bug in concatenated-bzip2 support (decoding). (mackrorysd: rev 2bc3351eaf240ea685bcf5042d79f1554bf89e00) * (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/BZip2Codec.java * (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TestConcatenatedCompressedInput.java * (add) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/resources/testdata/testConcatThenCompress.txt.gz * (add) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/resources/testdata/concat.bz2 * (add) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/resources/testdata/testCompressThenConcat.txt.gz * (add) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/resources/testdata/concat.gz * (add) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/resources/testdata/testConcatThenCompress.txt.bz2 * (add) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/resources/testdata/testCompressThenConcat.txt.bz2 * (edit) hadoop-client-modules/hadoop-client-minicluster/pom.xml > apparent bug in concatenated-bzip2 support (decoding) > - > > Key: HADOOP-6852 > URL: https://issues.apache.org/jira/browse/HADOOP-6852 > Project: Hadoop Common > Issue Type: Bug > Components: io >Affects Versions: 0.22.0 > Environment: Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15 >Reporter: Greg Roelofs >Assignee: Zsolt Venczel >Priority: Major > Fix For: 3.2.0 > > Attachments: HADOOP-6852.01.patch, HADOOP-6852.02.patch, > HADOOP-6852.03.patch, HADOOP-6852.04.patch > > > The following simplified code (manually picked out of testMoreBzip2() in > https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch) > triggers a "java.io.IOException: bad block header" in > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock( > CBZip2InputStream.java:527): > {noformat} > JobConf jobConf = new JobConf(defaultConf); > CompressionCodec bzip2 = new BZip2Codec(); > ReflectionUtils.setConf(bzip2, jobConf); > localFs.delete(workDir, true); > // copy multiple-member test file to HDFS > String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension(); > Path fnLocal2 = new > Path(System.getProperty("test.concat.data","/tmp"),fn2); > Path fnHDFS2 = new Path(workDir, fn2); > localFs.copyFromLocalFile(fnLocal2, fnHDFS2); > FileInputFormat.setInputPaths(jobConf, workDir); > final FileInputStream in2 = new FileInputStream(fnLocal2.toString()); > CompressionInputStream cin2 = bzip2.createInputStream(in2); > LineReader in = new LineReader(cin2); > Text out = new Text(); > int numBytes, totalBytes=0, lineNum=0; > while ((numBytes = in.readLine(out)) > 0) { > ++lineNum; > totalBytes += numBytes; > } > in.close(); > {noformat} > The specified file is also included in the H-6835 patch linked above, and > some additional debug output is included in the commented-out test loop > above. (Only in the linked, "v4" version of the patch, however--I'm about to > remove the debug stuff for checkin.) > It's possible I've done something completely boneheaded here, but the file, > at least, checks out in a subsequent set of subtests and with stock bzip2 > itself. Only the code above is problematic; it reads through the first > concatenated chunk (17 lines of text) just fine but chokes on the header of > the second one. Altogether, the test file contains 84 lines of text and 4 > concatenated bzip2 files. > (It's possible this is a mapreduce issue rather than common, but note that > the identical gzip test works fine. Possibly it's related to the > stream-vs-decompressor dichotomy, though; intentionally not supported?) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-6852) apparent bug in concatenated-bzip2 support (decoding)
[ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371917#comment-16371917 ] Zsolt Venczel commented on HADOOP-6852: --- Thanks for checking [~mackrorysd]! The binary files were first added into git by commit: a196766ea07775f18ded69bd9e8d239f8cfd3ccc as a restructuring described by HADOOP-7106. These were present prior to that in MR SVN repository I assume. > apparent bug in concatenated-bzip2 support (decoding) > - > > Key: HADOOP-6852 > URL: https://issues.apache.org/jira/browse/HADOOP-6852 > Project: Hadoop Common > Issue Type: Bug > Components: io >Affects Versions: 0.22.0 > Environment: Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15 >Reporter: Greg Roelofs >Assignee: Zsolt Venczel >Priority: Major > Attachments: HADOOP-6852.01.patch, HADOOP-6852.02.patch, > HADOOP-6852.03.patch, HADOOP-6852.04.patch > > > The following simplified code (manually picked out of testMoreBzip2() in > https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch) > triggers a "java.io.IOException: bad block header" in > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock( > CBZip2InputStream.java:527): > {noformat} > JobConf jobConf = new JobConf(defaultConf); > CompressionCodec bzip2 = new BZip2Codec(); > ReflectionUtils.setConf(bzip2, jobConf); > localFs.delete(workDir, true); > // copy multiple-member test file to HDFS > String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension(); > Path fnLocal2 = new > Path(System.getProperty("test.concat.data","/tmp"),fn2); > Path fnHDFS2 = new Path(workDir, fn2); > localFs.copyFromLocalFile(fnLocal2, fnHDFS2); > FileInputFormat.setInputPaths(jobConf, workDir); > final FileInputStream in2 = new FileInputStream(fnLocal2.toString()); > CompressionInputStream cin2 = bzip2.createInputStream(in2); > LineReader in = new LineReader(cin2); > Text out = new Text(); > int numBytes, totalBytes=0, lineNum=0; > while ((numBytes = in.readLine(out)) > 0) { > ++lineNum; > totalBytes += numBytes; > } > in.close(); > {noformat} > The specified file is also included in the H-6835 patch linked above, and > some additional debug output is included in the commented-out test loop > above. (Only in the linked, "v4" version of the patch, however--I'm about to > remove the debug stuff for checkin.) > It's possible I've done something completely boneheaded here, but the file, > at least, checks out in a subsequent set of subtests and with stock bzip2 > itself. Only the code above is problematic; it reads through the first > concatenated chunk (17 lines of text) just fine but chokes on the header of > the second one. Altogether, the test file contains 84 lines of text and 4 > concatenated bzip2 files. > (It's possible this is a mapreduce issue rather than common, but note that > the identical gzip test works fine. Possibly it's related to the > stream-vs-decompressor dichotomy, though; intentionally not supported?) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-6852) apparent bug in concatenated-bzip2 support (decoding)
[ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371694#comment-16371694 ] Sean Mackrory commented on HADOOP-6852: --- Actually one last thing: I'm not big on the idea of binary files checked into source control. Where did these come from? Since these are tests that were commented out at some point, I suspect you just restored them from before, but we should make sure that's documented. Ideally we'd document the source and how they were generated. I wonder if it's possible that generating them at run-time would cause problems to get missed because we're testing recently compressed files and not files compressed with an old implementation. Both should work. So overall I'm okay committing this as-is, but I'd like to document where the binaries came from. > apparent bug in concatenated-bzip2 support (decoding) > - > > Key: HADOOP-6852 > URL: https://issues.apache.org/jira/browse/HADOOP-6852 > Project: Hadoop Common > Issue Type: Bug > Components: io >Affects Versions: 0.22.0 > Environment: Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15 >Reporter: Greg Roelofs >Assignee: Zsolt Venczel >Priority: Major > Attachments: HADOOP-6852.01.patch, HADOOP-6852.02.patch, > HADOOP-6852.03.patch, HADOOP-6852.04.patch > > > The following simplified code (manually picked out of testMoreBzip2() in > https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch) > triggers a "java.io.IOException: bad block header" in > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock( > CBZip2InputStream.java:527): > {noformat} > JobConf jobConf = new JobConf(defaultConf); > CompressionCodec bzip2 = new BZip2Codec(); > ReflectionUtils.setConf(bzip2, jobConf); > localFs.delete(workDir, true); > // copy multiple-member test file to HDFS > String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension(); > Path fnLocal2 = new > Path(System.getProperty("test.concat.data","/tmp"),fn2); > Path fnHDFS2 = new Path(workDir, fn2); > localFs.copyFromLocalFile(fnLocal2, fnHDFS2); > FileInputFormat.setInputPaths(jobConf, workDir); > final FileInputStream in2 = new FileInputStream(fnLocal2.toString()); > CompressionInputStream cin2 = bzip2.createInputStream(in2); > LineReader in = new LineReader(cin2); > Text out = new Text(); > int numBytes, totalBytes=0, lineNum=0; > while ((numBytes = in.readLine(out)) > 0) { > ++lineNum; > totalBytes += numBytes; > } > in.close(); > {noformat} > The specified file is also included in the H-6835 patch linked above, and > some additional debug output is included in the commented-out test loop > above. (Only in the linked, "v4" version of the patch, however--I'm about to > remove the debug stuff for checkin.) > It's possible I've done something completely boneheaded here, but the file, > at least, checks out in a subsequent set of subtests and with stock bzip2 > itself. Only the code above is problematic; it reads through the first > concatenated chunk (17 lines of text) just fine but chokes on the header of > the second one. Altogether, the test file contains 84 lines of text and 4 > concatenated bzip2 files. > (It's possible this is a mapreduce issue rather than common, but note that > the identical gzip test works fine. Possibly it's related to the > stream-vs-decompressor dichotomy, though; intentionally not supported?) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-6852) apparent bug in concatenated-bzip2 support (decoding)
[ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371680#comment-16371680 ] Sean Mackrory commented on HADOOP-6852: --- Thanks for fixing the checkstyle issue. Test failure appears unrelated to me - I ran tests with and without your patch locally and saw no difference, except for tests that used to be commented out now work. Will commit today. > apparent bug in concatenated-bzip2 support (decoding) > - > > Key: HADOOP-6852 > URL: https://issues.apache.org/jira/browse/HADOOP-6852 > Project: Hadoop Common > Issue Type: Bug > Components: io >Affects Versions: 0.22.0 > Environment: Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15 >Reporter: Greg Roelofs >Assignee: Zsolt Venczel >Priority: Major > Attachments: HADOOP-6852.01.patch, HADOOP-6852.02.patch, > HADOOP-6852.03.patch, HADOOP-6852.04.patch > > > The following simplified code (manually picked out of testMoreBzip2() in > https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch) > triggers a "java.io.IOException: bad block header" in > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock( > CBZip2InputStream.java:527): > {noformat} > JobConf jobConf = new JobConf(defaultConf); > CompressionCodec bzip2 = new BZip2Codec(); > ReflectionUtils.setConf(bzip2, jobConf); > localFs.delete(workDir, true); > // copy multiple-member test file to HDFS > String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension(); > Path fnLocal2 = new > Path(System.getProperty("test.concat.data","/tmp"),fn2); > Path fnHDFS2 = new Path(workDir, fn2); > localFs.copyFromLocalFile(fnLocal2, fnHDFS2); > FileInputFormat.setInputPaths(jobConf, workDir); > final FileInputStream in2 = new FileInputStream(fnLocal2.toString()); > CompressionInputStream cin2 = bzip2.createInputStream(in2); > LineReader in = new LineReader(cin2); > Text out = new Text(); > int numBytes, totalBytes=0, lineNum=0; > while ((numBytes = in.readLine(out)) > 0) { > ++lineNum; > totalBytes += numBytes; > } > in.close(); > {noformat} > The specified file is also included in the H-6835 patch linked above, and > some additional debug output is included in the commented-out test loop > above. (Only in the linked, "v4" version of the patch, however--I'm about to > remove the debug stuff for checkin.) > It's possible I've done something completely boneheaded here, but the file, > at least, checks out in a subsequent set of subtests and with stock bzip2 > itself. Only the code above is problematic; it reads through the first > concatenated chunk (17 lines of text) just fine but chokes on the header of > the second one. Altogether, the test file contains 84 lines of text and 4 > concatenated bzip2 files. > (It's possible this is a mapreduce issue rather than common, but note that > the identical gzip test works fine. Possibly it's related to the > stream-vs-decompressor dichotomy, though; intentionally not supported?) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-6852) apparent bug in concatenated-bzip2 support (decoding)
[ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371568#comment-16371568 ] genericqa commented on HADOOP-6852: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 22s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 7 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 24s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 22m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 49s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-client-modules/hadoop-client-minicluster {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 58s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 22s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 18m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 18m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 36s{color} | {color:green} root: The patch generated 0 new + 69 unchanged - 11 fixed = 69 total (was 80) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 44s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-client-modules/hadoop-client-minicluster {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 8s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 55s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}139m 40s{color} | {color:red} hadoop-mapreduce-client-jobclient in the patch failed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 39s{color} | {color:green} hadoop-client-minicluster in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 45s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}267m 8s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image
[jira] [Commented] (HADOOP-6852) apparent bug in concatenated-bzip2 support (decoding)
[ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365109#comment-16365109 ] Sean Mackrory commented on HADOOP-6852: --- So after much review and reading the code, I'm a +1. I'm not an expert on the compression code, though, so I'm gonna hold off on committing for 1 more day in case anyone else watching the issue wants to chime in. I'll re-run tests tomorrow on latest trunk as well. > apparent bug in concatenated-bzip2 support (decoding) > - > > Key: HADOOP-6852 > URL: https://issues.apache.org/jira/browse/HADOOP-6852 > Project: Hadoop Common > Issue Type: Bug > Components: io >Affects Versions: 0.22.0 > Environment: Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15 >Reporter: Greg Roelofs >Assignee: Zsolt Venczel >Priority: Major > Attachments: HADOOP-6852.01.patch, HADOOP-6852.02.patch, > HADOOP-6852.03.patch > > > The following simplified code (manually picked out of testMoreBzip2() in > https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch) > triggers a "java.io.IOException: bad block header" in > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock( > CBZip2InputStream.java:527): > {noformat} > JobConf jobConf = new JobConf(defaultConf); > CompressionCodec bzip2 = new BZip2Codec(); > ReflectionUtils.setConf(bzip2, jobConf); > localFs.delete(workDir, true); > // copy multiple-member test file to HDFS > String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension(); > Path fnLocal2 = new > Path(System.getProperty("test.concat.data","/tmp"),fn2); > Path fnHDFS2 = new Path(workDir, fn2); > localFs.copyFromLocalFile(fnLocal2, fnHDFS2); > FileInputFormat.setInputPaths(jobConf, workDir); > final FileInputStream in2 = new FileInputStream(fnLocal2.toString()); > CompressionInputStream cin2 = bzip2.createInputStream(in2); > LineReader in = new LineReader(cin2); > Text out = new Text(); > int numBytes, totalBytes=0, lineNum=0; > while ((numBytes = in.readLine(out)) > 0) { > ++lineNum; > totalBytes += numBytes; > } > in.close(); > {noformat} > The specified file is also included in the H-6835 patch linked above, and > some additional debug output is included in the commented-out test loop > above. (Only in the linked, "v4" version of the patch, however--I'm about to > remove the debug stuff for checkin.) > It's possible I've done something completely boneheaded here, but the file, > at least, checks out in a subsequent set of subtests and with stock bzip2 > itself. Only the code above is problematic; it reads through the first > concatenated chunk (17 lines of text) just fine but chokes on the header of > the second one. Altogether, the test file contains 84 lines of text and 4 > concatenated bzip2 files. > (It's possible this is a mapreduce issue rather than common, but note that > the identical gzip test works fine. Possibly it's related to the > stream-vs-decompressor dichotomy, though; intentionally not supported?) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-6852) apparent bug in concatenated-bzip2 support (decoding)
[ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354517#comment-16354517 ] Zsolt Venczel commented on HADOOP-6852: --- The above two unit test failures seem to be unrelated to the provided patch. > apparent bug in concatenated-bzip2 support (decoding) > - > > Key: HADOOP-6852 > URL: https://issues.apache.org/jira/browse/HADOOP-6852 > Project: Hadoop Common > Issue Type: Bug > Components: io >Affects Versions: 0.22.0 > Environment: Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15 >Reporter: Greg Roelofs >Assignee: Zsolt Venczel >Priority: Major > Attachments: HADOOP-6852.01.patch, HADOOP-6852.02.patch, > HADOOP-6852.03.patch > > > The following simplified code (manually picked out of testMoreBzip2() in > https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch) > triggers a "java.io.IOException: bad block header" in > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock( > CBZip2InputStream.java:527): > {noformat} > JobConf jobConf = new JobConf(defaultConf); > CompressionCodec bzip2 = new BZip2Codec(); > ReflectionUtils.setConf(bzip2, jobConf); > localFs.delete(workDir, true); > // copy multiple-member test file to HDFS > String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension(); > Path fnLocal2 = new > Path(System.getProperty("test.concat.data","/tmp"),fn2); > Path fnHDFS2 = new Path(workDir, fn2); > localFs.copyFromLocalFile(fnLocal2, fnHDFS2); > FileInputFormat.setInputPaths(jobConf, workDir); > final FileInputStream in2 = new FileInputStream(fnLocal2.toString()); > CompressionInputStream cin2 = bzip2.createInputStream(in2); > LineReader in = new LineReader(cin2); > Text out = new Text(); > int numBytes, totalBytes=0, lineNum=0; > while ((numBytes = in.readLine(out)) > 0) { > ++lineNum; > totalBytes += numBytes; > } > in.close(); > {noformat} > The specified file is also included in the H-6835 patch linked above, and > some additional debug output is included in the commented-out test loop > above. (Only in the linked, "v4" version of the patch, however--I'm about to > remove the debug stuff for checkin.) > It's possible I've done something completely boneheaded here, but the file, > at least, checks out in a subsequent set of subtests and with stock bzip2 > itself. Only the code above is problematic; it reads through the first > concatenated chunk (17 lines of text) just fine but chokes on the header of > the second one. Altogether, the test file contains 84 lines of text and 4 > concatenated bzip2 files. > (It's possible this is a mapreduce issue rather than common, but note that > the identical gzip test works fine. Possibly it's related to the > stream-vs-decompressor dichotomy, though; intentionally not supported?) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-6852) apparent bug in concatenated-bzip2 support (decoding)
[ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354289#comment-16354289 ] genericqa commented on HADOOP-6852: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 14m 59s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 7 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 19s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 13m 31s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 7s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 11s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 0s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-client-modules/hadoop-client-minicluster {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 40s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 12m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 12m 16s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 2m 6s{color} | {color:orange} root: The patch generated 1 new + 69 unchanged - 11 fixed = 70 total (was 80) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 24s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-client-modules/hadoop-client-minicluster {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 16s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 22s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}127m 5s{color} | {color:red} hadoop-mapreduce-client-jobclient in the patch failed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 31s{color} | {color:green} hadoop-client-minicluster in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 42s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}239m 46s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.mapreduce.v2.TestUberAM | |
[jira] [Commented] (HADOOP-6852) apparent bug in concatenated-bzip2 support (decoding)
[ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16353881#comment-16353881 ] genericqa commented on HADOOP-6852: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 24s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 7 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 27s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 31s{color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 18m 23s{color} | {color:red} root in trunk failed. {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 26s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-client-modules/hadoop-client-minicluster {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 52s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 22s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 16s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 17m 57s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 17m 57s{color} | {color:red} root generated 183 new + 1054 unchanged - 0 fixed = 1237 total (was 1054) {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 2m 24s{color} | {color:orange} root: The patch generated 16 new + 71 unchanged - 9 fixed = 87 total (was 80) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 25s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-client-modules/hadoop-client-minicluster {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 50s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 18s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}130m 15s{color} | {color:red} hadoop-mapreduce-client-jobclient in the patch failed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 3s{color} | {color:green} hadoop-client-minicluster in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 56s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}243m 49s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Faile
[jira] [Commented] (HADOOP-6852) apparent bug in concatenated-bzip2 support (decoding)
[ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352392#comment-16352392 ] Zsolt Venczel commented on HADOOP-6852: --- In the attached patch I have added previously available test files back, re-enabled test-cases and added a proposed fix: within BZip2Codec I've changed the default read mode from CONTINUOUS to BYBLOCK at BZip2Codec.createInputStream(InputStream in, Decompressor decompressor) as BYBLOCK read mode handles concatenated bzip2 correctly and also this way it is consistent with mapred/LineRecordReader and input/LineRecordReader decompressor creation logic. As a result of the change the concatenated-bzip2 issue is fixed. > apparent bug in concatenated-bzip2 support (decoding) > - > > Key: HADOOP-6852 > URL: https://issues.apache.org/jira/browse/HADOOP-6852 > Project: Hadoop Common > Issue Type: Bug > Components: io >Affects Versions: 0.22.0 > Environment: Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15 >Reporter: Greg Roelofs >Assignee: Zsolt Venczel >Priority: Major > Attachments: HADOOP-6852.01.patch > > > The following simplified code (manually picked out of testMoreBzip2() in > https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch) > triggers a "java.io.IOException: bad block header" in > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock( > CBZip2InputStream.java:527): > {noformat} > JobConf jobConf = new JobConf(defaultConf); > CompressionCodec bzip2 = new BZip2Codec(); > ReflectionUtils.setConf(bzip2, jobConf); > localFs.delete(workDir, true); > // copy multiple-member test file to HDFS > String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension(); > Path fnLocal2 = new > Path(System.getProperty("test.concat.data","/tmp"),fn2); > Path fnHDFS2 = new Path(workDir, fn2); > localFs.copyFromLocalFile(fnLocal2, fnHDFS2); > FileInputFormat.setInputPaths(jobConf, workDir); > final FileInputStream in2 = new FileInputStream(fnLocal2.toString()); > CompressionInputStream cin2 = bzip2.createInputStream(in2); > LineReader in = new LineReader(cin2); > Text out = new Text(); > int numBytes, totalBytes=0, lineNum=0; > while ((numBytes = in.readLine(out)) > 0) { > ++lineNum; > totalBytes += numBytes; > } > in.close(); > {noformat} > The specified file is also included in the H-6835 patch linked above, and > some additional debug output is included in the commented-out test loop > above. (Only in the linked, "v4" version of the patch, however--I'm about to > remove the debug stuff for checkin.) > It's possible I've done something completely boneheaded here, but the file, > at least, checks out in a subsequent set of subtests and with stock bzip2 > itself. Only the code above is problematic; it reads through the first > concatenated chunk (17 lines of text) just fine but chokes on the header of > the second one. Altogether, the test file contains 84 lines of text and 4 > concatenated bzip2 files. > (It's possible this is a mapreduce issue rather than common, but note that > the identical gzip test works fine. Possibly it's related to the > stream-vs-decompressor dichotomy, though; intentionally not supported?) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-6852) apparent bug in concatenated-bzip2 support (decoding)
[ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352394#comment-16352394 ] genericqa commented on HADOOP-6852: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} docker {color} | {color:red} 3m 54s{color} | {color:red} Docker failed to build yetus/hadoop:5b98639. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HADOOP-6852 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12909222/HADOOP-6852.01.patch | | Console output | https://builds.apache.org/job/PreCommit-HADOOP-Build/14070/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > apparent bug in concatenated-bzip2 support (decoding) > - > > Key: HADOOP-6852 > URL: https://issues.apache.org/jira/browse/HADOOP-6852 > Project: Hadoop Common > Issue Type: Bug > Components: io >Affects Versions: 0.22.0 > Environment: Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15 >Reporter: Greg Roelofs >Assignee: Zsolt Venczel >Priority: Major > Attachments: HADOOP-6852.01.patch > > > The following simplified code (manually picked out of testMoreBzip2() in > https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch) > triggers a "java.io.IOException: bad block header" in > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock( > CBZip2InputStream.java:527): > {noformat} > JobConf jobConf = new JobConf(defaultConf); > CompressionCodec bzip2 = new BZip2Codec(); > ReflectionUtils.setConf(bzip2, jobConf); > localFs.delete(workDir, true); > // copy multiple-member test file to HDFS > String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension(); > Path fnLocal2 = new > Path(System.getProperty("test.concat.data","/tmp"),fn2); > Path fnHDFS2 = new Path(workDir, fn2); > localFs.copyFromLocalFile(fnLocal2, fnHDFS2); > FileInputFormat.setInputPaths(jobConf, workDir); > final FileInputStream in2 = new FileInputStream(fnLocal2.toString()); > CompressionInputStream cin2 = bzip2.createInputStream(in2); > LineReader in = new LineReader(cin2); > Text out = new Text(); > int numBytes, totalBytes=0, lineNum=0; > while ((numBytes = in.readLine(out)) > 0) { > ++lineNum; > totalBytes += numBytes; > } > in.close(); > {noformat} > The specified file is also included in the H-6835 patch linked above, and > some additional debug output is included in the commented-out test loop > above. (Only in the linked, "v4" version of the patch, however--I'm about to > remove the debug stuff for checkin.) > It's possible I've done something completely boneheaded here, but the file, > at least, checks out in a subsequent set of subtests and with stock bzip2 > itself. Only the code above is problematic; it reads through the first > concatenated chunk (17 lines of text) just fine but chokes on the header of > the second one. Altogether, the test file contains 84 lines of text and 4 > concatenated bzip2 files. > (It's possible this is a mapreduce issue rather than common, but note that > the identical gzip test works fine. Possibly it's related to the > stream-vs-decompressor dichotomy, though; intentionally not supported?) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-6852) apparent bug in concatenated-bzip2 support (decoding)
[ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938152#comment-13938152 ] Bob Tiernay commented on HADOOP-6852: - Just hit this bug. Is there any plans to fix this bug, or do we as a community need to add something to a project like Elephant Bird to circumvent the issue? Thanks. > apparent bug in concatenated-bzip2 support (decoding) > - > > Key: HADOOP-6852 > URL: https://issues.apache.org/jira/browse/HADOOP-6852 > Project: Hadoop Common > Issue Type: Bug > Components: io >Affects Versions: 0.22.0 > Environment: Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15 >Reporter: Greg Roelofs > > The following simplified code (manually picked out of testMoreBzip2() in > https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch) > triggers a "java.io.IOException: bad block header" in > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock( > CBZip2InputStream.java:527): > {noformat} > JobConf jobConf = new JobConf(defaultConf); > CompressionCodec bzip2 = new BZip2Codec(); > ReflectionUtils.setConf(bzip2, jobConf); > localFs.delete(workDir, true); > // copy multiple-member test file to HDFS > String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension(); > Path fnLocal2 = new > Path(System.getProperty("test.concat.data","/tmp"),fn2); > Path fnHDFS2 = new Path(workDir, fn2); > localFs.copyFromLocalFile(fnLocal2, fnHDFS2); > FileInputFormat.setInputPaths(jobConf, workDir); > final FileInputStream in2 = new FileInputStream(fnLocal2.toString()); > CompressionInputStream cin2 = bzip2.createInputStream(in2); > LineReader in = new LineReader(cin2); > Text out = new Text(); > int numBytes, totalBytes=0, lineNum=0; > while ((numBytes = in.readLine(out)) > 0) { > ++lineNum; > totalBytes += numBytes; > } > in.close(); > {noformat} > The specified file is also included in the H-6835 patch linked above, and > some additional debug output is included in the commented-out test loop > above. (Only in the linked, "v4" version of the patch, however--I'm about to > remove the debug stuff for checkin.) > It's possible I've done something completely boneheaded here, but the file, > at least, checks out in a subsequent set of subtests and with stock bzip2 > itself. Only the code above is problematic; it reads through the first > concatenated chunk (17 lines of text) just fine but chokes on the header of > the second one. Altogether, the test file contains 84 lines of text and 4 > concatenated bzip2 files. > (It's possible this is a mapreduce issue rather than common, but note that > the identical gzip test works fine. Possibly it's related to the > stream-vs-decompressor dichotomy, though; intentionally not supported?) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-6852) apparent bug in concatenated-bzip2 support (decoding)
[ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405813#comment-13405813 ] Iker Jimenez commented on HADOOP-6852: -- This is happening to us too when we compress files with pbzip2. Any ETA for a fix? > apparent bug in concatenated-bzip2 support (decoding) > - > > Key: HADOOP-6852 > URL: https://issues.apache.org/jira/browse/HADOOP-6852 > Project: Hadoop Common > Issue Type: Bug > Components: io >Affects Versions: 0.22.0 > Environment: Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15 >Reporter: Greg Roelofs > > The following simplified code (manually picked out of testMoreBzip2() in > https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch) > triggers a "java.io.IOException: bad block header" in > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock( > CBZip2InputStream.java:527): > {noformat} > JobConf jobConf = new JobConf(defaultConf); > CompressionCodec bzip2 = new BZip2Codec(); > ReflectionUtils.setConf(bzip2, jobConf); > localFs.delete(workDir, true); > // copy multiple-member test file to HDFS > String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension(); > Path fnLocal2 = new > Path(System.getProperty("test.concat.data","/tmp"),fn2); > Path fnHDFS2 = new Path(workDir, fn2); > localFs.copyFromLocalFile(fnLocal2, fnHDFS2); > FileInputFormat.setInputPaths(jobConf, workDir); > final FileInputStream in2 = new FileInputStream(fnLocal2.toString()); > CompressionInputStream cin2 = bzip2.createInputStream(in2); > LineReader in = new LineReader(cin2); > Text out = new Text(); > int numBytes, totalBytes=0, lineNum=0; > while ((numBytes = in.readLine(out)) > 0) { > ++lineNum; > totalBytes += numBytes; > } > in.close(); > {noformat} > The specified file is also included in the H-6835 patch linked above, and > some additional debug output is included in the commented-out test loop > above. (Only in the linked, "v4" version of the patch, however--I'm about to > remove the debug stuff for checkin.) > It's possible I've done something completely boneheaded here, but the file, > at least, checks out in a subsequent set of subtests and with stock bzip2 > itself. Only the code above is problematic; it reads through the first > concatenated chunk (17 lines of text) just fine but chokes on the header of > the second one. Altogether, the test file contains 84 lines of text and 4 > concatenated bzip2 files. > (It's possible this is a mapreduce issue rather than common, but note that > the identical gzip test works fine. Possibly it's related to the > stream-vs-decompressor dichotomy, though; intentionally not supported?) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-6852) apparent bug in concatenated-bzip2 support (decoding)
[ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137479#comment-13137479 ] Len Trigg commented on HADOOP-6852: --- We have been using the ant-based bzip2 library for our project and needed to be able to decompress concatenated bzip files. After poking around we came across the hadoop extensions and immediately found that it did not function correctly due to this bug. Essentially when crossing block boundaries the skipToNextMarker method leaves the stream position at the end of the block delimiter, but initBlock expects to be at the beginning of the block delimiter. After looking at the poor structure of the initBlock method, and the thread-unsafety that has been introduced into this class with the numberOfBytesTillNextMarker() method, we decided to avoid the hadoop version of this class altogether. > apparent bug in concatenated-bzip2 support (decoding) > - > > Key: HADOOP-6852 > URL: https://issues.apache.org/jira/browse/HADOOP-6852 > Project: Hadoop Common > Issue Type: Bug > Components: io >Affects Versions: 0.22.0 > Environment: Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15 >Reporter: Greg Roelofs > > The following simplified code (manually picked out of testMoreBzip2() in > https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch) > triggers a "java.io.IOException: bad block header" in > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock( > CBZip2InputStream.java:527): > {noformat} > JobConf jobConf = new JobConf(defaultConf); > CompressionCodec bzip2 = new BZip2Codec(); > ReflectionUtils.setConf(bzip2, jobConf); > localFs.delete(workDir, true); > // copy multiple-member test file to HDFS > String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension(); > Path fnLocal2 = new > Path(System.getProperty("test.concat.data","/tmp"),fn2); > Path fnHDFS2 = new Path(workDir, fn2); > localFs.copyFromLocalFile(fnLocal2, fnHDFS2); > FileInputFormat.setInputPaths(jobConf, workDir); > final FileInputStream in2 = new FileInputStream(fnLocal2.toString()); > CompressionInputStream cin2 = bzip2.createInputStream(in2); > LineReader in = new LineReader(cin2); > Text out = new Text(); > int numBytes, totalBytes=0, lineNum=0; > while ((numBytes = in.readLine(out)) > 0) { > ++lineNum; > totalBytes += numBytes; > } > in.close(); > {noformat} > The specified file is also included in the H-6835 patch linked above, and > some additional debug output is included in the commented-out test loop > above. (Only in the linked, "v4" version of the patch, however--I'm about to > remove the debug stuff for checkin.) > It's possible I've done something completely boneheaded here, but the file, > at least, checks out in a subsequent set of subtests and with stock bzip2 > itself. Only the code above is problematic; it reads through the first > concatenated chunk (17 lines of text) just fine but chokes on the header of > the second one. Altogether, the test file contains 84 lines of text and 4 > concatenated bzip2 files. > (It's possible this is a mapreduce issue rather than common, but note that > the identical gzip test works fine. Possibly it's related to the > stream-vs-decompressor dichotomy, though; intentionally not supported?) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HADOOP-6852) apparent bug in concatenated-bzip2 support (decoding)
[ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986044#action_12986044 ] Todd Lipcon commented on HADOOP-6852: - Wouter: do you have HADOOP-6925 in your build? Worth trying that to see if your problem is the same as Greg's or the one we fixed in the other JIRA. > apparent bug in concatenated-bzip2 support (decoding) > - > > Key: HADOOP-6852 > URL: https://issues.apache.org/jira/browse/HADOOP-6852 > Project: Hadoop Common > Issue Type: Bug > Components: io >Affects Versions: 0.22.0 > Environment: Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15 >Reporter: Greg Roelofs > > The following simplified code (manually picked out of testMoreBzip2() in > https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch) > triggers a "java.io.IOException: bad block header" in > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock( > CBZip2InputStream.java:527): > {noformat} > JobConf jobConf = new JobConf(defaultConf); > CompressionCodec bzip2 = new BZip2Codec(); > ReflectionUtils.setConf(bzip2, jobConf); > localFs.delete(workDir, true); > // copy multiple-member test file to HDFS > String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension(); > Path fnLocal2 = new > Path(System.getProperty("test.concat.data","/tmp"),fn2); > Path fnHDFS2 = new Path(workDir, fn2); > localFs.copyFromLocalFile(fnLocal2, fnHDFS2); > FileInputFormat.setInputPaths(jobConf, workDir); > final FileInputStream in2 = new FileInputStream(fnLocal2.toString()); > CompressionInputStream cin2 = bzip2.createInputStream(in2); > LineReader in = new LineReader(cin2); > Text out = new Text(); > int numBytes, totalBytes=0, lineNum=0; > while ((numBytes = in.readLine(out)) > 0) { > ++lineNum; > totalBytes += numBytes; > } > in.close(); > {noformat} > The specified file is also included in the H-6835 patch linked above, and > some additional debug output is included in the commented-out test loop > above. (Only in the linked, "v4" version of the patch, however--I'm about to > remove the debug stuff for checkin.) > It's possible I've done something completely boneheaded here, but the file, > at least, checks out in a subsequent set of subtests and with stock bzip2 > itself. Only the code above is problematic; it reads through the first > concatenated chunk (17 lines of text) just fine but chokes on the header of > the second one. Altogether, the test file contains 84 lines of text and 4 > concatenated bzip2 files. > (It's possible this is a mapreduce issue rather than common, but note that > the identical gzip test works fine. Possibly it's related to the > stream-vs-decompressor dichotomy, though; intentionally not supported?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6852) apparent bug in concatenated-bzip2 support (decoding)
[ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903063#action_12903063 ] Wouter de Bie commented on HADOOP-6852: --- I'm having the same problem. We write files from our logging application using the BZip2Codec, but our application rotates files on an hourly basis. Next to that, when we flush log lines, we compress, which in turn causes different bzip2 block sizes. Anyway, we're also getting the 'bad block header' exception. Looking through some code and trying different things, we get things to work when we change the following in org.apache.hadoop.io.compress.bzip2.CBZip2InputStream: {noformat} public CBZip2InputStream(final InputStream in) throws IOException { this(in, READ_MODE.CONTINUOUS); } {noformat} to {noformat} public CBZip2InputStream(final InputStream in) throws IOException { this(in, READ_MODE.BYBLOCK); } {noformat} Which causes CBZip2InputStream to use the block size in initBlock(), instead of looking for the magic numbers. I'll be digging into some code more tomorrow, but to quickly move forward on this issue, I would like to know why is CONTINUOUS the default mode and is there a place where the read mode is determined? BZip2 is block based, so why not default to that? I'll also try the above piece of code and see if that works with BYBLOCK. > apparent bug in concatenated-bzip2 support (decoding) > - > > Key: HADOOP-6852 > URL: https://issues.apache.org/jira/browse/HADOOP-6852 > Project: Hadoop Common > Issue Type: Bug > Components: io >Affects Versions: 0.22.0 > Environment: Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15 >Reporter: Greg Roelofs > > The following simplified code (manually picked out of testMoreBzip2() in > https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch) > triggers a "java.io.IOException: bad block header" in > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock( > CBZip2InputStream.java:527): > {noformat} > JobConf jobConf = new JobConf(defaultConf); > CompressionCodec bzip2 = new BZip2Codec(); > ReflectionUtils.setConf(bzip2, jobConf); > localFs.delete(workDir, true); > // copy multiple-member test file to HDFS > String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension(); > Path fnLocal2 = new > Path(System.getProperty("test.concat.data","/tmp"),fn2); > Path fnHDFS2 = new Path(workDir, fn2); > localFs.copyFromLocalFile(fnLocal2, fnHDFS2); > FileInputFormat.setInputPaths(jobConf, workDir); > final FileInputStream in2 = new FileInputStream(fnLocal2.toString()); > CompressionInputStream cin2 = bzip2.createInputStream(in2); > LineReader in = new LineReader(cin2); > Text out = new Text(); > int numBytes, totalBytes=0, lineNum=0; > while ((numBytes = in.readLine(out)) > 0) { > ++lineNum; > totalBytes += numBytes; > } > in.close(); > {noformat} > The specified file is also included in the H-6835 patch linked above, and > some additional debug output is included in the commented-out test loop > above. (Only in the linked, "v4" version of the patch, however--I'm about to > remove the debug stuff for checkin.) > It's possible I've done something completely boneheaded here, but the file, > at least, checks out in a subsequent set of subtests and with stock bzip2 > itself. Only the code above is problematic; it reads through the first > concatenated chunk (17 lines of text) just fine but chokes on the header of > the second one. Altogether, the test file contains 84 lines of text and 4 > concatenated bzip2 files. > (It's possible this is a mapreduce issue rather than common, but note that > the identical gzip test works fine. Possibly it's related to the > stream-vs-decompressor dichotomy, though; intentionally not supported?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6852) apparent bug in concatenated-bzip2 support (decoding)
[ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902268#action_12902268 ] Greg Roelofs commented on HADOOP-6852: -- Alas, it doesn't. :-( > apparent bug in concatenated-bzip2 support (decoding) > - > > Key: HADOOP-6852 > URL: https://issues.apache.org/jira/browse/HADOOP-6852 > Project: Hadoop Common > Issue Type: Bug > Components: io >Affects Versions: 0.22.0 > Environment: Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15 >Reporter: Greg Roelofs > > The following simplified code (manually picked out of testMoreBzip2() in > https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch) > triggers a "java.io.IOException: bad block header" in > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock( > CBZip2InputStream.java:527): > {noformat} > JobConf jobConf = new JobConf(defaultConf); > CompressionCodec bzip2 = new BZip2Codec(); > ReflectionUtils.setConf(bzip2, jobConf); > localFs.delete(workDir, true); > // copy multiple-member test file to HDFS > String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension(); > Path fnLocal2 = new > Path(System.getProperty("test.concat.data","/tmp"),fn2); > Path fnHDFS2 = new Path(workDir, fn2); > localFs.copyFromLocalFile(fnLocal2, fnHDFS2); > FileInputFormat.setInputPaths(jobConf, workDir); > final FileInputStream in2 = new FileInputStream(fnLocal2.toString()); > CompressionInputStream cin2 = bzip2.createInputStream(in2); > LineReader in = new LineReader(cin2); > Text out = new Text(); > int numBytes, totalBytes=0, lineNum=0; > while ((numBytes = in.readLine(out)) > 0) { > ++lineNum; > totalBytes += numBytes; > } > in.close(); > {noformat} > The specified file is also included in the H-6835 patch linked above, and > some additional debug output is included in the commented-out test loop > above. (Only in the linked, "v4" version of the patch, however--I'm about to > remove the debug stuff for checkin.) > It's possible I've done something completely boneheaded here, but the file, > at least, checks out in a subsequent set of subtests and with stock bzip2 > itself. Only the code above is problematic; it reads through the first > concatenated chunk (17 lines of text) just fine but chokes on the header of > the second one. Altogether, the test file contains 84 lines of text and 4 > concatenated bzip2 files. > (It's possible this is a mapreduce issue rather than common, but note that > the identical gzip test works fine. Possibly it's related to the > stream-vs-decompressor dichotomy, though; intentionally not supported?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.