Hi,
It sounds to me like this could be related to one of the Qs I've posted several
days ago (is it?):
My mahout clustering processes seem to be running very slow (several good hours
on just ~1M items), and I'm wondering if there's anything that needs to be
changed in setting/configuration. (and how?)
I'm running on a large cluster and could potentially use thousands of
nodes (mappers/reducers). However, my mahout processes (kmeans/canopy.) are
only using max 5 mappers (I tried it on several data sets).
I've tried to define the number of mappers by something like:
-Dmapred.map.tasks=100 but this didn't seem to have an effect, it still only
uses <=5 mappers.
Is there a different way to set the number of mappers/reducers for a
mahout process?
Or is there another configuration issue I need to consider?
I'd definitely be happy to use such a parameter, does it not exist?
(I'm running mahout as installed on the cluster)
Is there currently a workaround, besides running a mahout jar as an hadoop job?
When I originally tried to run a mahout jar that uses KMeansDriver (and that
runs great on my local machine)- it did not even initiate a job on the hadoop
cluster. It seemed to be running parallel but in fact it was running only on
the local node. Is this a known issue? Is there a fix for this? (I
ended up dropping it and calling mahout step by step from command line, but I'd
be happy to know if there a fix for this).
Thanks,
Galit.
-----Original Message-----
From: Ryan Josal [mailto:[email protected]]
Sent: Monday, July 29, 2013 9:33 PM
To: Adam Baron
Cc: Ryan Josal; [email protected]
Subject: Re: Run more than one mapper for TestForest?
If you're running mahout from the CLI, you'll have to modify the Hadoop config
file or your env manually for each job. This is code I put in to my custom job
executions so I didn't have to calculate and set that up every time. Maybe
that's your best route in that position. You could just provide your own
mahout jar and run it as you would any other Hadoop job and ignore the
installed Mahout. I do think this could be a useful parameter for a number of
standard mahout jobs though; I know I would use it. Does anyone in the mahout
community see this as a generally useful feature for a Mahout job?
Ryan
On Jul 29, 2013, at 10:25, Adam Baron <[email protected]> wrote:
> Ryan,
>
> Thanks for the fix, the code looks reasonable to me. Which version of Mahout
> will this be in? 0.9?
>
> Unfortunately, I'm using a large shared Hadoop cluster which is not
> administered by my team. So I'm not in a position push the latest from the
> Mahout dev trunk into our environment; the admins will only install official
> releases.
>
> Regards,
> Adam
>
> On Sun, Jul 28, 2013 at 5:37 PM, Ryan Josal <[email protected]> wrote:
>> Late reply, but for what it's still worth, since I've seen a couple other
>> threads here on the topic of too few mappers, I added a parameter to set a
>> minimum number of mappers. Some of my mahout jobs needed more mappers, but
>> were not given many because of the small input file size.
>>
>> addOption("minMapTasks", "m", "Minimum number of map tasks to
>> run", String.valueOf(1));
>>
>>
>> int minMapTasks = Integer.parseInt(getOption("minMapTasks"));
>> int mapTasksThatWouldRun = (int)
>> (vectorFileSizeBytes/getSplitSize()) + 1;
>> log.info("map tasks min: " + minMapTasks + " current: " +
>> mapTasksThatWouldRun);
>> if (minMapTasks > mapTasksThatWouldRun) {
>> String splitSizeBytes =
>> String.valueOf(vectorFileSizeBytes/minMapTasks);
>> log.info("Forcing mapred.max.split.size to " + splitSizeBytes +
>> " to ensure minimum map tasks = " + minMapTasks);
>> hadoopConf.set("mapred.max.split.size", splitSizeBytes);
>> }
>>
>> // there is actually a private method in hadoop to calculate this
>> private long getSplitSize() {
>> long blockSize = hadoopConf.getLong("dfs.block.size", 64 * 1024 *
>> 1024);
>> long maxSize = hadoopConf.getLong("mapred.max.split.size",
>> Long.MAX_VALUE);
>> int minSize = hadoopConf.getInt("mapred.min.split.size", 1);
>> long splitSize = Math.max(minSize, Math.min(maxSize, blockSize));
>> log.info(String.format("min: %,d block: %,d max: %,d split: %,d",
>> minSize, blockSize, maxSize, splitSize));
>> return splitSize;
>> }
>>
>> It seems like there should be a more straightforward way to do this, but it
>> works for me and I've used it on a lot of jobs to set a minimum number of
>> mappers.
>>
>> Ryan
>>
>> On Jul 5, 2013, at 2:00 PM, Adam Baron wrote:
>>
>> > I'm attempting to run
>> > org.apache.mahout.classifier.df.mapreduce.TestForest
>> > on a CSV with 200,000 rows that have 500,000 features per row.
>> > However, TestForest is running extremely slow, likely because only
>> > 1 mapper was assigned to the job. This seems strange because the
>> > org.apache.mahout.classifier.df.mapreduce.BuildForest step on the
>> > same data used 1772 mappers and took about 6 minutes. (BTW: I know
>> > I
>> > *shouldn't* use the same data set for the training and the testing
>> > steps; this is purely a technical experiment to see if Mahout's
>> > Random Forest can handle the data sizes we typically deal with).
>> >
>> > Any idea on how to get
>> > org.apache.mahout.classifier.df.mapreduce.TestForest
>> > to use more mappers? Glancing at the code (and thinking about what
>> > is happening intuitively), it should be ripe for parallelization.
>> >
>> > Thanks,
>> > Adam
>