The following might answer you partially: Input key is not read from HDFS, it is auto generated as the offset of the input value in the input file. I think that is (partially) why read hdfs bytes is smaller than written hdfs bytes. On Mar 27, 2014 1:34 PM, "Kim Chew" <[email protected]> wrote:
> I am also wondering if, say, I have two identical timestamp so they are > going to be written to the same file. Does MulitpleOutputs handle appending? > > Thanks. > > Kim > > > On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <[email protected]> wrote: > >> Have you checked the content of the files you write? >> >> >> /th >> >> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote: >> > I have a simple M/R job using Mapper only thus no reducer. The mapper >> > read a timestamp from the value, generate a path to the output file >> > and writes the key and value to the output file. >> > >> > >> > The input file is a sequence file, not compressed and stored in the >> > HDFS, it has a size of 162.68 MB. >> > >> > >> > Output also is written as a sequence file. >> > >> > >> > >> > However, after I ran my job, I have two output part files from the >> > mapper. One has a size of 835.12 MB and the other has a size of 224.77 >> > MB. So why is the total outputs size is so much larger? Shouldn't it >> > be more or less equal to the input's size of 162.68MB since I just >> > write the key and value passed to mapper to the output? >> > >> > >> > Here is the mapper code snippet, >> > >> > public void map(BytesWritable key, BytesWritable value, Context >> > context) throws IOException, InterruptedException { >> > >> > long timestamp = bytesToInt(value.getBytes(), >> > TIMESTAMP_INDEX);; >> > String tsStr = sdf.format(new Date(timestamp * 1000L)); >> > >> > mos.write(key, value, generateFileName(tsStr)); // mos is a >> > MultipleOutputs object. >> > } >> > >> > private String generateFileName(String key) { >> > return outputDir+"/"+key+"/raw-vectors"; >> > } >> > >> > >> > And here are the job outputs, >> > >> > 14/03/27 11:00:56 INFO mapred.JobClient: Launched map tasks=2 >> > 14/03/27 11:00:56 INFO mapred.JobClient: Data-local map tasks=2 >> > 14/03/27 11:00:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 >> > 14/03/27 11:00:56 INFO mapred.JobClient: File Output Format >> > Counters >> > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Written=0 >> > 14/03/27 11:00:56 INFO mapred.JobClient: FileSystemCounters >> > 14/03/27 11:00:56 INFO mapred.JobClient: HDFS_BYTES_READ=171086386 >> > 14/03/27 11:00:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54272 >> > 14/03/27 11:00:56 INFO mapred.JobClient: >> > HDFS_BYTES_WRITTEN=1111374798 >> > 14/03/27 11:00:56 INFO mapred.JobClient: File Input Format Counters >> > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Read=170782415 >> > 14/03/27 11:00:56 INFO mapred.JobClient: Map-Reduce Framework >> > 14/03/27 11:00:56 INFO mapred.JobClient: Map input records=547 >> > 14/03/27 11:00:56 INFO mapred.JobClient: Physical memory (bytes) >> > snapshot=166428672 >> > 14/03/27 11:00:56 INFO mapred.JobClient: Spilled Records=0 >> > 14/03/27 11:00:56 INFO mapred.JobClient: Total committed heap >> > usage (bytes)=38351872 >> > 14/03/27 11:00:56 INFO mapred.JobClient: CPU time spent (ms)=20080 >> > 14/03/27 11:00:56 INFO mapred.JobClient: Virtual memory (bytes) >> > snapshot=1240104960 >> > 14/03/27 11:00:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=286 >> > 14/03/27 11:00:56 INFO mapred.JobClient: Map output records=0 >> > >> > >> > TIA, >> > >> > >> > Kim >> > >> >> >> >
