[ https://issues.apache.org/jira/browse/HADOOP-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Qiming He reopened HADOOP-9442: ------------------------------- see my comments > Splitting issue when using NLineInputFormat with compression > ------------------------------------------------------------ > > Key: HADOOP-9442 > URL: https://issues.apache.org/jira/browse/HADOOP-9442 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 1.1.2 > Environment: Try in Apache Hadoop 1.1.1, CDH4, and Amazon EMR. Same > result. > Reporter: Qiming He > Priority: Minor > > #make a long text line. It seems only long line text causing issue. > $ cat abook.txt | base64 –w 0 >onelinetext.b64 #200KB+ long > $ hadoop fs –put onelinetext.b64 /input/onelinetext.b64 > $ hadoop jar hadoop-streaming.jar \ > -input /input/onelinetext.b64 \ > -output /output \ > -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \ > –mapper wc > Num task: 1, and output has one line: > Line 1: 1 2 202699 > which makes sense because one line per mapper is intended. > Then, using compression with NLineInputFormat > $ bzip2 onelinetext.b64 > $ hadoop fs –put onelinetext.b64.bz2 /input/onelinetext.b64.bz2 > $ hadoop jar hadoop-streaming.jar \ > -Dmapred.input.compress=true \ > > -Dmapred.input.compression.codec=org.apache.hadoop.io.compress.GzipCodec \ > -input /input/onelinetext.b64.gz \ > -output /output \ > -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \ > –mapper wc > I am expecting the same results as above, 'coz decompressing should occur > before processing one-line text (i.e. wc), however, I am getting: > Num task: 397 (or other large numbers depend on environments), and output has > 397 lines: > Line1-396: 0 0 0 > Line 397: 1 2 202699 > Any idea why so many mapred.map.tasks >>1? Is it incorrect splitting? I > purposely choose gzip because I believe it is NOT split-able. I got similar > results when using bzip2 and lzop codecs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira