RE: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?

WangRamon Tue, 18 Oct 2011 02:29:47 -0700
Hi Sean I will try to increase properties "io.sort.factor" and "io.sort.mb" in 
core-site.xml and to see what happen. BTW, I see you use  String javaOpts = 
conf.get("mapred.child.java.opts"); to get the heap size for each map/reduce 
task, that's fine for Hadoop 0.20.2 and before, but since 0.20.3 it has been 
replaced by "mapred.map.child.java.opts" and "mapred.reduce.child.java.opts", 
so may be you should use a default configuration or make it as a argument for 
user to input. CheersRamon 
 > Date: Tue, 18 Oct 2011 09:58:30 +0100
> Subject: Re: Any general performance tips for job 
> RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?
> From: [email protected]
> To: [email protected]
> 
> If the merge phase is what's taking a while, I can suggest two
> parameter changes to help speed that up. (This is in addition to what
> Sebastian said.)
> 
> First, I think it's useful to let it do a 100-way segment merge
> instead of 10-way. (Or more.) This is controlled by "io.sort.factor"
> in Hadoop.
> 
> Second, you probably want to let the combiner do more combining, to
> reduce the number of records spilled and merged. For this, you set
> "io.sort.mb". This job has a Combiner so it's valid. You could set it
> up to half of your worker memory or so.
> 
> Here's a section of code in RecommenderJob that is used to configure
> this all automatically on a JobContext; if it works for you, we could
> include it in this job too:
> 
>   private static void setIOSort(JobContext job) {
>     Configuration conf = job.getConfiguration();
>     conf.setInt("io.sort.factor", 100);
>     int assumedHeapSize = 512;
>     String javaOpts = conf.get("mapred.child.java.opts");
>     if (javaOpts != null) {
>       Matcher m = Pattern.compile("-Xmx([0-9]+)([mMgG])").matcher(javaOpts);
>       if (m.find()) {
>         assumedHeapSize = Integer.parseInt(m.group(1));
>         String megabyteOrGigabyte = m.group(2);
>         if ("g".equalsIgnoreCase(megabyteOrGigabyte)) {
>           assumedHeapSize *= 1024;
>         }
>       }
>     }
>     // Cap this at 1024MB now; see
> https://issues.apache.org/jira/browse/MAPREDUCE-2308
>     conf.setInt("io.sort.mb", Math.min(assumedHeapSize / 2, 1024));
>     // For some reason the Merger doesn't report status for a long
> time; increase
>     // timeout when running these jobs
>     conf.setInt("mapred.task.timeout", 60 * 60 * 1000);
>   }
> 
> 
> 2011/10/18 WangRamon <[email protected]>:
> >
> >
> >
> >
> > Hi All I'm running a recommend job on a Hadoop environment with about 
> > 600000 users and 2000000 items, the total user-pref records is about 
> > 66260000, the data file is of 1GB size. I found the 
> > RowSimilarityJob-CooccurrencesMapper-SimilarityReducer job is very slow, 
> > and get a lot of logs like these in the mapper task output:  2011-10-18 
> > 15:18:49,300 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate 
> > segments out of a total of 73
> > 2011-10-18 15:20:23,410 INFO org.apache.hadoop.mapred.Merger: Merging 10 
> > intermediate segments out of a total of 64
> > 2011-10-18 15:22:45,466 INFO org.apache.hadoop.mapred.Merger: Merging 10 
> > intermediate segments out of a total of 55
> > 2011-10-18 15:25:07,928 INFO org.apache.hadoop.mapred.Merger: Merging 10 
> > intermediate segments out of a total of 46
> > Actually, i do find some similar question from the mail list, e.g. 
> > http://mail-archives.apache.org/mod_mbox/mahout-user/201104.mbox/%[email protected]%3E
> >  , Sebastian said something about to use Mahout 0.5 in that mail thread, 
> > and yes i'm using Mahout 0.5, however there is no further discussion, it 
> > will be great if you guys can share some ideas/suggestions here, that will 
> > be a big help to me, thanks in advance. BTW, i have the following 
> > parameters already set in Hadoop:mapred.child.java.opts -> 
> > 2048Mfs.inmemory.size.mb->200io.file.buffer.size->131072 I have two 
> > servers, each with 32GB RAM, THANKS! CheersRamon
RE: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?

Reply via email to