Hi Sean I will try to increase properties "io.sort.factor" and "io.sort.mb" in
core-site.xml and to see what happen. BTW, I see you use String javaOpts =
conf.get("mapred.child.java.opts"); to get the heap size for each map/reduce
task, that's fine for Hadoop 0.20.2 and before, but since 0.20.3 it has been
replaced by "mapred.map.child.java.opts" and "mapred.reduce.child.java.opts",
so may be you should use a default configuration or make it as a argument for
user to input. CheersRamon
> Date: Tue, 18 Oct 2011 09:58:30 +0100
> Subject: Re: Any general performance tips for job
> RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?
> From: [email protected]
> To: [email protected]
>
> If the merge phase is what's taking a while, I can suggest two
> parameter changes to help speed that up. (This is in addition to what
> Sebastian said.)
>
> First, I think it's useful to let it do a 100-way segment merge
> instead of 10-way. (Or more.) This is controlled by "io.sort.factor"
> in Hadoop.
>
> Second, you probably want to let the combiner do more combining, to
> reduce the number of records spilled and merged. For this, you set
> "io.sort.mb". This job has a Combiner so it's valid. You could set it
> up to half of your worker memory or so.
>
> Here's a section of code in RecommenderJob that is used to configure
> this all automatically on a JobContext; if it works for you, we could
> include it in this job too:
>
> private static void setIOSort(JobContext job) {
> Configuration conf = job.getConfiguration();
> conf.setInt("io.sort.factor", 100);
> int assumedHeapSize = 512;
> String javaOpts = conf.get("mapred.child.java.opts");
> if (javaOpts != null) {
> Matcher m = Pattern.compile("-Xmx([0-9]+)([mMgG])").matcher(javaOpts);
> if (m.find()) {
> assumedHeapSize = Integer.parseInt(m.group(1));
> String megabyteOrGigabyte = m.group(2);
> if ("g".equalsIgnoreCase(megabyteOrGigabyte)) {
> assumedHeapSize *= 1024;
> }
> }
> }
> // Cap this at 1024MB now; see
> https://issues.apache.org/jira/browse/MAPREDUCE-2308
> conf.setInt("io.sort.mb", Math.min(assumedHeapSize / 2, 1024));
> // For some reason the Merger doesn't report status for a long
> time; increase
> // timeout when running these jobs
> conf.setInt("mapred.task.timeout", 60 * 60 * 1000);
> }
>
>
> 2011/10/18 WangRamon <[email protected]>:
> >
> >
> >
> >
> > Hi All I'm running a recommend job on a Hadoop environment with about
> > 600000 users and 2000000 items, the total user-pref records is about
> > 66260000, the data file is of 1GB size. I found the
> > RowSimilarityJob-CooccurrencesMapper-SimilarityReducer job is very slow,
> > and get a lot of logs like these in the mapper task output: 2011-10-18
> > 15:18:49,300 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate
> > segments out of a total of 73
> > 2011-10-18 15:20:23,410 INFO org.apache.hadoop.mapred.Merger: Merging 10
> > intermediate segments out of a total of 64
> > 2011-10-18 15:22:45,466 INFO org.apache.hadoop.mapred.Merger: Merging 10
> > intermediate segments out of a total of 55
> > 2011-10-18 15:25:07,928 INFO org.apache.hadoop.mapred.Merger: Merging 10
> > intermediate segments out of a total of 46
> > Actually, i do find some similar question from the mail list, e.g.
> > http://mail-archives.apache.org/mod_mbox/mahout-user/201104.mbox/%[email protected]%3E
> > , Sebastian said something about to use Mahout 0.5 in that mail thread,
> > and yes i'm using Mahout 0.5, however there is no further discussion, it
> > will be great if you guys can share some ideas/suggestions here, that will
> > be a big help to me, thanks in advance. BTW, i have the following
> > parameters already set in Hadoop:mapred.child.java.opts ->
> > 2048Mfs.inmemory.size.mb->200io.file.buffer.size->131072 I have two
> > servers, each with 32GB RAM, THANKS! CheersRamon