I have a simple M/R job using Mapper only thus no reducer. The mapper read
a timestamp from the value, generate a path to the output file and writes
the key and value to the output file.
The input file is a sequence file, not compressed and stored in the HDFS,
it has a size of 162.68 MB.
Output also is written as a sequence file.
However, after I ran my job, I have two output part files from the mapper.
One has a size of 835.12 MB and the other has a size of 224.77 MB. So why
is the total outputs size is so much larger? Shouldn't it be more or less
equal to the input's size of 162.68MB since I just write the key and value
passed to mapper to the output?
Here is the mapper code snippet,
public void map(BytesWritable key, BytesWritable value, Context context)
throws IOException, InterruptedException {
long timestamp = bytesToInt(value.getBytes(), TIMESTAMP_INDEX);;
String tsStr = sdf.format(new Date(timestamp * 1000L));
mos.write(key, value, generateFileName(tsStr)); // mos is a
MultipleOutputs object.
}
private String generateFileName(String key) {
return outputDir+"/"+key+"/raw-vectors";
}
And here are the job outputs,
14/03/27 11:00:56 INFO mapred.JobClient: Launched map tasks=2
14/03/27 11:00:56 INFO mapred.JobClient: Data-local map tasks=2
14/03/27 11:00:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
14/03/27 11:00:56 INFO mapred.JobClient: File Output Format Counters
14/03/27 11:00:56 INFO mapred.JobClient: Bytes Written=0
14/03/27 11:00:56 INFO mapred.JobClient: FileSystemCounters
14/03/27 11:00:56 INFO mapred.JobClient: HDFS_BYTES_READ=171086386
14/03/27 11:00:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54272
14/03/27 11:00:56 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1111374798
14/03/27 11:00:56 INFO mapred.JobClient: File Input Format Counters
14/03/27 11:00:56 INFO mapred.JobClient: Bytes Read=170782415
14/03/27 11:00:56 INFO mapred.JobClient: Map-Reduce Framework
14/03/27 11:00:56 INFO mapred.JobClient: Map input records=547
14/03/27 11:00:56 INFO mapred.JobClient: Physical memory (bytes)
snapshot=166428672
14/03/27 11:00:56 INFO mapred.JobClient: Spilled Records=0
14/03/27 11:00:56 INFO mapred.JobClient: Total committed heap usage
(bytes)=38351872
14/03/27 11:00:56 INFO mapred.JobClient: CPU time spent (ms)=20080
14/03/27 11:00:56 INFO mapred.JobClient: Virtual memory (bytes)
snapshot=1240104960
14/03/27 11:00:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=286
14/03/27 11:00:56 INFO mapred.JobClient: Map output records=0
TIA,
Kim