Have you checked the content of the files you write?
/th
On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
> I have a simple M/R job using Mapper only thus no reducer. The mapper
> read a timestamp from the value, generate a path to the output file
> and writes the key and value to the output file.
>
>
> The input file is a sequence file, not compressed and stored in the
> HDFS, it has a size of 162.68 MB.
>
>
> Output also is written as a sequence file.
>
>
>
> However, after I ran my job, I have two output part files from the
> mapper. One has a size of 835.12 MB and the other has a size of 224.77
> MB. So why is the total outputs size is so much larger? Shouldn't it
> be more or less equal to the input's size of 162.68MB since I just
> write the key and value passed to mapper to the output?
>
>
> Here is the mapper code snippet,
>
> public void map(BytesWritable key, BytesWritable value, Context
> context) throws IOException, InterruptedException {
>
> long timestamp = bytesToInt(value.getBytes(),
> TIMESTAMP_INDEX);;
> String tsStr = sdf.format(new Date(timestamp * 1000L));
>
> mos.write(key, value, generateFileName(tsStr)); // mos is a
> MultipleOutputs object.
> }
>
> private String generateFileName(String key) {
> return outputDir+"/"+key+"/raw-vectors";
> }
>
>
> And here are the job outputs,
>
> 14/03/27 11:00:56 INFO mapred.JobClient: Launched map tasks=2
> 14/03/27 11:00:56 INFO mapred.JobClient: Data-local map tasks=2
> 14/03/27 11:00:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
> 14/03/27 11:00:56 INFO mapred.JobClient: File Output Format
> Counters
> 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Written=0
> 14/03/27 11:00:56 INFO mapred.JobClient: FileSystemCounters
> 14/03/27 11:00:56 INFO mapred.JobClient: HDFS_BYTES_READ=171086386
> 14/03/27 11:00:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54272
> 14/03/27 11:00:56 INFO mapred.JobClient:
> HDFS_BYTES_WRITTEN=1111374798
> 14/03/27 11:00:56 INFO mapred.JobClient: File Input Format Counters
> 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Read=170782415
> 14/03/27 11:00:56 INFO mapred.JobClient: Map-Reduce Framework
> 14/03/27 11:00:56 INFO mapred.JobClient: Map input records=547
> 14/03/27 11:00:56 INFO mapred.JobClient: Physical memory (bytes)
> snapshot=166428672
> 14/03/27 11:00:56 INFO mapred.JobClient: Spilled Records=0
> 14/03/27 11:00:56 INFO mapred.JobClient: Total committed heap
> usage (bytes)=38351872
> 14/03/27 11:00:56 INFO mapred.JobClient: CPU time spent (ms)=20080
> 14/03/27 11:00:56 INFO mapred.JobClient: Virtual memory (bytes)
> snapshot=1240104960
> 14/03/27 11:00:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=286
> 14/03/27 11:00:56 INFO mapred.JobClient: Map output records=0
>
>
> TIA,
>
>
> Kim
>