Hello Harsh, Thanks for the useful feedback. You were right. My map tasks open additional files from hdfs. The catch was that I had thousands of map tasks being created and each of them was repeatedly reading the same files from hdfs which ultimately dominated the job execution time. I re-arranged the minimum split size for the job and reduced the number of map tasks spawned by the master node.
Cheers, Jim On Thu, May 16, 2013 at 2:56 AM, Harsh J <[email protected]> wrote: > Hi Jim, > > The counters you're looking at are counted at the FileSystem interface > level, not at the more specific Task level (which have "map input > bytes" and such). > > This means that if your map or reduce code is opening side-files/using > a FileSystem object to read extra things, the count will go up as > expected. > > For simple input and output size validation of a job, minus anything > the code does on top, its better to look at "map/reduce input/output > bytes" form of counters instead. > > On Tue, May 14, 2013 at 10:41 PM, Jim Twensky <[email protected]> > wrote: > > I have an iterative MapReduce job that I run over 35 GB of data > repeatedly. > > The output of the first job is the input to the second one and it goes on > > like that until convergence. > > > > I am seeing a strange behavior with the program run time. The first > > iteration takes 4 minutes to run and here is how the counters look: > > > > HDFS_BYTES_READ 34,860,867,377 > > HDFS_BYTES_WRITTEN 45,573,255,806 > > > > The second iteration takes 15 minutes and here is how the counters look > in > > this case: > > > > HDFS_BYTES_READ 144,563,459,448 > > HDFS_BYTES_WRITTEN 49,779,966,388 > > > > I cannot explain these numbers because the first iteration - to begin > with - > > should only generate approximately 35 GB of output. When I check the > output > > size using > > > > hadoop fs -dus > > > > I can confirm that it is indeed 35 GB. But for some reason > > HDFS_BYTES_WRITTEN shows 45 GB. Then the input to the second iteration > > should be 35 GB (or even 45GB considering the counter value) > > but HDFS_BYTES_READ shows 144 GB. > > > > All following iterations produce similar counter values to the second one > > and they roughly take 15 min each. My dfs replication factor is set to 1 > and > > there is no compression turned on. All input and outputs are in > SequenceFile > > format. The initial input is a sequence file that I generated locally > using > > SequenceFile.Writer but I use the default values and as far as I know > > compression should be turned off. Am I wrong? > > > > Thanks in advance. > > > > -- > Harsh J >
