Just thought I'd provide some insight into our problem. It appears that the problem was a slowdown caused by the use of multipleOutputs.write(output, key, keyValue, path) (going from memory here). Anyways, after looking at the implementation of that write function in multipleOutputs.java it appears that a context was created and a conf was gotten and a new recordWriter was gotten for every call to write(output, key, keyValue, path).
We have changed all of those calls to write(output, key, keyValue) (which doesn't do any extra things) and it seems to help. Anyone else has any tips when using multipleOutputs? We are taking our input and splitting it into 3 files. So it seems to be a natural choice for MultipleOutputs. Performance is a bit slow though. Cheers! David ________________________________________ From: David Poisson [[email protected]] Sent: Thursday, June 27, 2013 4:22 PM To: [email protected] Subject: Profiling map reduce jobs? Howdy, I want to take a look at a MR job which seems to be slower than I had hoped. Mind you, this MR job is only running on a pseudo-distributed VM (cloudera cdh4). I have modified my mapred-site.xml with the following (that last one is commented out because it crashes my MR job): <property> <name>mapred.task.profile</name> <value>true</value> </property> <property> <name>mapred.task.profile.maps</name> <value>0-2</value> </property> <property> <name>mapred.task.profile.reduces</name> <value>0-2</value> </property> <!--property> <name>mapred.task.profile.params</name> <value>agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s</value> </property--> Are there any resources that explain how to interpret the results? Or maybe an open-source app that could help display the results in a more intuiative manner? Ideally, we'd want to know where we are spending most of our time. Cheers, David
