Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Kim Chew Fri, 28 Mar 2014 11:39:48 -0700

None of that.

I checked the the input file's SequenceFile Header and it says
"org.apache.hadoop.io.compress.zlib.BuiltInZlibDeflater"


Kim


On Fri, Mar 28, 2014 at 10:34 AM, Hardik Pandya <[email protected]>wrote:

> what is your compression format gzip, lzo or snappy
>
> for lzo final output
>
> FileOutputFormat.setCompressOutput(conf, true);
> FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class);
>
> In addition, to make LZO splittable, you need to make a LZO index file.
>
>
> On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew <[email protected]> wrote:
>
>> Thanks folks.
>>
>> I am not awared my input data file has been compressed.
>> FileOutputFromat.setCompressOutput() is set to true when the file is
>> written. 8-(
>>
>> Kim
>>
>>
>> On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead <[email protected]>wrote:
>>
>>> The following might answer you partially:
>>>
>>> Input key is not read from HDFS, it is auto generated as the offset of
>>> the input value in the input file. I think that is (partially) why read
>>> hdfs bytes is smaller than written hdfs bytes.
>>>  On Mar 27, 2014 1:34 PM, "Kim Chew" <[email protected]> wrote:
>>>
>>>> I am also wondering if, say, I have two identical timestamp so they are
>>>> going to be written to the same file. Does MulitpleOutputs handle 
>>>> appending?
>>>>
>>>> Thanks.
>>>>
>>>> Kim
>>>>
>>>>
>>>> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <[email protected]> wrote:
>>>>
>>>>> Have you checked the content of the files you write?
>>>>>
>>>>>
>>>>> /th
>>>>>
>>>>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>>>>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>>>>> > read a timestamp from the value, generate a path to the output file
>>>>> > and writes the key and value to the output file.
>>>>> >
>>>>> >
>>>>> > The input file is a sequence file, not compressed and stored in the
>>>>> > HDFS, it has a size of 162.68 MB.
>>>>> >
>>>>> >
>>>>> > Output also is written as a sequence file.
>>>>> >
>>>>> >
>>>>> >
>>>>> > However, after I ran my job, I have two output part files from the
>>>>> > mapper. One has a size of 835.12 MB and the other has a size of
>>>>> 224.77
>>>>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>>>>> > be more or less equal to the input's size of 162.68MB since I just
>>>>> > write the key and value passed to mapper to the output?
>>>>> >
>>>>> >
>>>>> > Here is the mapper code snippet,
>>>>> >
>>>>> > public void map(BytesWritable key, BytesWritable value, Context
>>>>> > context) throws IOException, InterruptedException {
>>>>> >
>>>>> >         long timestamp = bytesToInt(value.getBytes(),
>>>>> > TIMESTAMP_INDEX);;
>>>>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>>>>> >
>>>>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>>>>> > MultipleOutputs object.
>>>>> >     }
>>>>> >
>>>>> >         private String generateFileName(String key) {
>>>>> >         return outputDir+"/"+key+"/raw-vectors";
>>>>> >     }
>>>>> >
>>>>> >
>>>>> > And here are the job outputs,
>>>>> >
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>>>>> > Counters
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>>>> HDFS_BYTES_READ=171086386
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>>>> > HDFS_BYTES_WRITTEN=1111374798
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>>>>> > snapshot=166428672
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>>>>> > usage (bytes)=38351872
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent
>>>>> (ms)=20080
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>>>>> > snapshot=1240104960
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>>>>> >
>>>>> >
>>>>> > TIA,
>>>>> >
>>>>> >
>>>>> > Kim
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>
>>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Reply via email to