Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Mostafa Ead Thu, 27 Mar 2014 17:47:08 -0700

The following might answer you partially:

Input key is not read from HDFS, it is auto generated as the offset of the
input value in the input file. I think that is (partially) why read hdfs
bytes is smaller than written hdfs bytes.
 On Mar 27, 2014 1:34 PM, "Kim Chew" <[email protected]> wrote:


> I am also wondering if, say, I have two identical timestamp so they are
> going to be written to the same file. Does MulitpleOutputs handle appending?
>
> Thanks.
>
> Kim
>
>
> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <[email protected]> wrote:
>
>> Have you checked the content of the files you write?
>>
>>
>> /th
>>
>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>> > read a timestamp from the value, generate a path to the output file
>> > and writes the key and value to the output file.
>> >
>> >
>> > The input file is a sequence file, not compressed and stored in the
>> > HDFS, it has a size of 162.68 MB.
>> >
>> >
>> > Output also is written as a sequence file.
>> >
>> >
>> >
>> > However, after I ran my job, I have two output part files from the
>> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>> > be more or less equal to the input's size of 162.68MB since I just
>> > write the key and value passed to mapper to the output?
>> >
>> >
>> > Here is the mapper code snippet,
>> >
>> > public void map(BytesWritable key, BytesWritable value, Context
>> > context) throws IOException, InterruptedException {
>> >
>> >         long timestamp = bytesToInt(value.getBytes(),
>> > TIMESTAMP_INDEX);;
>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>> >
>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>> > MultipleOutputs object.
>> >     }
>> >
>> >         private String generateFileName(String key) {
>> >         return outputDir+"/"+key+"/raw-vectors";
>> >     }
>> >
>> >
>> > And here are the job outputs,
>> >
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>> > Counters
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>> > HDFS_BYTES_WRITTEN=1111374798
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>> > snapshot=166428672
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>> > usage (bytes)=38351872
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>> > snapshot=1240104960
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>> >
>> >
>> > TIA,
>> >
>> >
>> > Kim
>> >
>>
>>
>>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Reply via email to