Just thought I'd provide some insight into our problem. 

It appears that the problem was a slowdown caused by the use of 
multipleOutputs.write(output, key, keyValue, path) (going from memory here). 
Anyways, after looking at the implementation of that write function  in 
multipleOutputs.java it appears that a context was created and a conf was 
gotten and a new recordWriter was gotten for every call to write(output, key, 
keyValue, path).

We have changed all of those calls to write(output, key, keyValue) (which 
doesn't do any extra things) and it seems to help.

Anyone else has any tips when using multipleOutputs?

We are taking our input and splitting it into 3 files. So it seems to be a 
natural choice for MultipleOutputs. Performance is a bit slow though.

Cheers!

David
________________________________________
From: David Poisson [[email protected]]
Sent: Thursday, June 27, 2013 4:22 PM
To: [email protected]
Subject: Profiling map reduce jobs?

Howdy,
     I want to take a look at a MR job which seems to be slower than I had 
hoped. Mind you, this MR job is only running on a pseudo-distributed VM 
(cloudera cdh4).

I have modified my mapred-site.xml with the following (that last one is 
commented out because it crashes my MR job):

  <property>
    <name>mapred.task.profile</name>
    <value>true</value>
  </property>
  <property>
    <name>mapred.task.profile.maps</name>
    <value>0-2</value>
  </property>
  <property>
    <name>mapred.task.profile.reduces</name>
    <value>0-2</value>
  </property>
  <!--property>
    <name>mapred.task.profile.params</name>
    
<value>agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s</value>
  </property-->
Are there any resources that explain how to interpret the results?
Or maybe an open-source app that could help display the results in a more 
intuiative manner?

Ideally, we'd want to know where we are spending most of our time.

Cheers,

David

Reply via email to