Modify number of mappers for a mahout process?

Fuhrmann Alpert, Galit Wed, 31 Jul 2013 01:58:40 -0700

Hi,

It sounds to me like this could be related to one of the Qs I've posted several 
days ago (is it?):
My mahout clustering processes seem to be running very slow (several good hours 
on just ~1M items), and I'm wondering if there's anything that needs to be 
changed in setting/configuration. (and how?)
        I'm running on a large cluster and could potentially use thousands of 
nodes (mappers/reducers). However, my mahout processes (kmeans/canopy.) are 
only using max 5 mappers (I tried it on several data sets). 
        I've tried to define the number of mappers by something like: 
-Dmapred.map.tasks=100 but this didn't seem to have an effect, it still only 
uses <=5 mappers.
        Is there a different way to set the number of mappers/reducers for a 
mahout process?
        Or is there another configuration issue I need to consider?


I'd definitely be happy to use such a parameter, does it not exist?
(I'm running mahout as installed on the cluster)

Is there currently a workaround, besides running a mahout jar as an hadoop job?
When I originally tried to run a mahout jar that uses KMeansDriver (and that 
runs great on my local machine)- it did not even initiate a job on the hadoop 
cluster. It seemed to be running parallel but in fact it was running only on 
the local node.         Is this a known issue? Is there a fix for this? (I 
ended up dropping it and calling mahout step by step from command line, but I'd 
be happy to know if there a fix for this).

Thanks,

Galit.

-----Original Message-----
From: Ryan Josal [mailto:[email protected]] 
Sent: Monday, July 29, 2013 9:33 PM
To: Adam Baron
Cc: Ryan Josal; [email protected]
Subject: Re: Run more than one mapper for TestForest?

If you're running mahout from the CLI, you'll have to modify the Hadoop config 
file or your env manually for each job.  This is code I put in to my custom job 
executions so I didn't have to calculate and set that up every time.  Maybe 
that's your best route in that position.  You could just provide your own 
mahout jar and run it as you would any other Hadoop job and ignore the 
installed Mahout.  I do think this could be a useful parameter for a number of 
standard mahout jobs though; I know I would use it.  Does anyone in the mahout 
community see this as a generally useful feature for a Mahout job?

Ryan

On Jul 29, 2013, at 10:25, Adam Baron <[email protected]> wrote:

> Ryan,
> 
> Thanks for the fix, the code looks reasonable to me.  Which version of Mahout 
> will this be in?  0.9?
> 
> Unfortunately, I'm using a large shared Hadoop cluster which is not 
> administered by my team.   So I'm not in a position push the latest from the 
> Mahout dev trunk into our environment; the admins will only install official 
> releases.
> 
> Regards,
>           Adam
> 
> On Sun, Jul 28, 2013 at 5:37 PM, Ryan Josal <[email protected]> wrote:
>> Late reply, but for what it's still worth, since I've seen a couple other 
>> threads here on the topic of too few mappers, I added a parameter to set a 
>> minimum number of mappers.  Some of my mahout jobs needed more mappers, but 
>> were not given many because of the small input file size.
>> 
>>         addOption("minMapTasks", "m", "Minimum number of map tasks to 
>> run", String.valueOf(1));
>> 
>> 
>>         int minMapTasks = Integer.parseInt(getOption("minMapTasks"));
>>         int mapTasksThatWouldRun = (int) 
>> (vectorFileSizeBytes/getSplitSize()) + 1;
>>         log.info("map tasks min: " + minMapTasks + " current: " + 
>> mapTasksThatWouldRun);
>>         if (minMapTasks > mapTasksThatWouldRun) {
>>             String splitSizeBytes = 
>> String.valueOf(vectorFileSizeBytes/minMapTasks);
>>             log.info("Forcing mapred.max.split.size to " + splitSizeBytes + 
>> " to ensure minimum map tasks = " + minMapTasks);
>>             hadoopConf.set("mapred.max.split.size", splitSizeBytes);
>>         }
>> 
>>     // there is actually a private method in hadoop to calculate this
>>     private long getSplitSize() {
>>         long blockSize = hadoopConf.getLong("dfs.block.size", 64 * 1024 * 
>> 1024);
>>         long maxSize = hadoopConf.getLong("mapred.max.split.size", 
>> Long.MAX_VALUE);
>>         int minSize = hadoopConf.getInt("mapred.min.split.size", 1);
>>         long splitSize = Math.max(minSize, Math.min(maxSize, blockSize));
>>         log.info(String.format("min: %,d block: %,d max: %,d split: %,d", 
>> minSize, blockSize, maxSize, splitSize));
>>         return splitSize;
>>     }
>> 
>> It seems like there should be a more straightforward way to do this, but it 
>> works for me and I've used it on a lot of jobs to set a minimum number of 
>> mappers.
>> 
>> Ryan
>> 
>> On Jul 5, 2013, at 2:00 PM, Adam Baron wrote:
>> 
>> > I'm attempting to run 
>> > org.apache.mahout.classifier.df.mapreduce.TestForest
>> > on a CSV with 200,000 rows that have 500,000 features per row.
>> > However, TestForest is  running extremely slow, likely because only 
>> > 1 mapper was assigned to the job.  This seems strange because the 
>> > org.apache.mahout.classifier.df.mapreduce.BuildForest step on the 
>> > same data used 1772 mappers and took about 6 minutes.  (BTW: I know 
>> > I
>> > *shouldn't* use the same data set for the training and the testing 
>> > steps; this is purely a technical experiment to see if Mahout's 
>> > Random Forest can handle the data sizes we typically deal with).
>> >
>> > Any idea on how to get 
>> > org.apache.mahout.classifier.df.mapreduce.TestForest
>> > to use more mappers?  Glancing at the code (and thinking about what 
>> > is happening intuitively), it should be ripe for parallelization.
>> >
>> > Thanks,
>> >        Adam
>

Modify number of mappers for a mahout process?

Reply via email to