Galit, yes this does sound like this is related, and as Matt said, you can test 
this by setting the max split size on the CLI.  I didn't personally find this 
to be a reliable and efficient method, so I wrote the -m parameter to my job to 
set it right every time.  It seems that this would be useful to have as a 
general parameter for Mahout jobs; is there agreement on this, and if so can I 
get some guidance on how to contribute?

Ryan

On Aug 1, 2013, at 8:00, Matt Molek <[email protected]> wrote:

> One trick to getting more mappers on a job when running from the command
> line is to pass a '-Dmapred.max.split.size=xxxx' argument. The xxxx is a
> size in bytes. So if you have some hypothetical 10MB input set, but you
> want to force ~100 mappers, use '-Dmapred.max.split.size=1000000'
> 
> 
> On Wed, Jul 31, 2013 at 4:57 AM, Fuhrmann Alpert, Galit 
> <[email protected]>wrote:
> 
>> 
>> Hi,
>> 
>> It sounds to me like this could be related to one of the Qs I've posted
>> several days ago (is it?):
>> My mahout clustering processes seem to be running very slow (several good
>> hours on just ~1M items), and I'm wondering if there's anything that needs
>> to be changed in setting/configuration. (and how?)
>>        I'm running on a large cluster and could potentially use thousands
>> of nodes (mappers/reducers). However, my mahout processes (kmeans/canopy.)
>> are only using max 5 mappers (I tried it on several data sets).
>>        I've tried to define the number of mappers by something like:
>> -Dmapred.map.tasks=100 but this didn't seem to have an effect, it still
>> only uses <=5 mappers.
>>        Is there a different way to set the number of mappers/reducers for
>> a mahout process?
>>        Or is there another configuration issue I need to consider?
>> 
>> I'd definitely be happy to use such a parameter, does it not exist?
>> (I'm running mahout as installed on the cluster)
>> 
>> Is there currently a workaround, besides running a mahout jar as an hadoop
>> job?
>> When I originally tried to run a mahout jar that uses KMeansDriver (and
>> that runs great on my local machine)- it did not even initiate a job on the
>> hadoop cluster. It seemed to be running parallel but in fact it was running
>> only on the local node.         Is this a known issue? Is there a fix for
>> this? (I ended up dropping it and calling mahout step by step from command
>> line, but I'd be happy to know if there a fix for this).
>> 
>> Thanks,
>> 
>> Galit.
>> 
>> -----Original Message-----
>> From: Ryan Josal [mailto:[email protected]]
>> Sent: Monday, July 29, 2013 9:33 PM
>> To: Adam Baron
>> Cc: Ryan Josal; [email protected]
>> Subject: Re: Run more than one mapper for TestForest?
>> 
>> If you're running mahout from the CLI, you'll have to modify the Hadoop
>> config file or your env manually for each job.  This is code I put in to my
>> custom job executions so I didn't have to calculate and set that up every
>> time.  Maybe that's your best route in that position.  You could just
>> provide your own mahout jar and run it as you would any other Hadoop job
>> and ignore the installed Mahout.  I do think this could be a useful
>> parameter for a number of standard mahout jobs though; I know I would use
>> it.  Does anyone in the mahout community see this as a generally useful
>> feature for a Mahout job?
>> 
>> Ryan
>> 
>> On Jul 29, 2013, at 10:25, Adam Baron <[email protected]> wrote:
>> 
>>> Ryan,
>>> 
>>> Thanks for the fix, the code looks reasonable to me.  Which version of
>> Mahout will this be in?  0.9?
>>> 
>>> Unfortunately, I'm using a large shared Hadoop cluster which is not
>> administered by my team.   So I'm not in a position push the latest from
>> the Mahout dev trunk into our environment; the admins will only install
>> official releases.
>>> 
>>> Regards,
>>>          Adam
>>> 
>>> On Sun, Jul 28, 2013 at 5:37 PM, Ryan Josal <[email protected]> wrote:
>>>> Late reply, but for what it's still worth, since I've seen a couple
>> other threads here on the topic of too few mappers, I added a parameter to
>> set a minimum number of mappers.  Some of my mahout jobs needed more
>> mappers, but were not given many because of the small input file size.
>>>> 
>>>>        addOption("minMapTasks", "m", "Minimum number of map tasks to
>>>> run", String.valueOf(1));
>>>> 
>>>> 
>>>>        int minMapTasks = Integer.parseInt(getOption("minMapTasks"));
>>>>        int mapTasksThatWouldRun = (int)
>> (vectorFileSizeBytes/getSplitSize()) + 1;
>>>>        log.info("map tasks min: " + minMapTasks + " current: " +
>> mapTasksThatWouldRun);
>>>>        if (minMapTasks > mapTasksThatWouldRun) {
>>>>            String splitSizeBytes =
>> String.valueOf(vectorFileSizeBytes/minMapTasks);
>>>>            log.info("Forcing mapred.max.split.size to " +
>> splitSizeBytes + " to ensure minimum map tasks = " + minMapTasks);
>>>>            hadoopConf.set("mapred.max.split.size", splitSizeBytes);
>>>>        }
>>>> 
>>>>    // there is actually a private method in hadoop to calculate this
>>>>    private long getSplitSize() {
>>>>        long blockSize = hadoopConf.getLong("dfs.block.size", 64 * 1024
>> * 1024);
>>>>        long maxSize = hadoopConf.getLong("mapred.max.split.size",
>> Long.MAX_VALUE);
>>>>        int minSize = hadoopConf.getInt("mapred.min.split.size", 1);
>>>>        long splitSize = Math.max(minSize, Math.min(maxSize,
>> blockSize));
>>>>        log.info(String.format("min: %,d block: %,d max: %,d split:
>> %,d", minSize, blockSize, maxSize, splitSize));
>>>>        return splitSize;
>>>>    }
>>>> 
>>>> It seems like there should be a more straightforward way to do this,
>> but it works for me and I've used it on a lot of jobs to set a minimum
>> number of mappers.
>>>> 
>>>> Ryan
>>>> 
>>>> On Jul 5, 2013, at 2:00 PM, Adam Baron wrote:
>>>> 
>>>>> I'm attempting to run
>>>>> org.apache.mahout.classifier.df.mapreduce.TestForest
>>>>> on a CSV with 200,000 rows that have 500,000 features per row.
>>>>> However, TestForest is  running extremely slow, likely because only
>>>>> 1 mapper was assigned to the job.  This seems strange because the
>>>>> org.apache.mahout.classifier.df.mapreduce.BuildForest step on the
>>>>> same data used 1772 mappers and took about 6 minutes.  (BTW: I know
>>>>> I
>>>>> *shouldn't* use the same data set for the training and the testing
>>>>> steps; this is purely a technical experiment to see if Mahout's
>>>>> Random Forest can handle the data sizes we typically deal with).
>>>>> 
>>>>> Any idea on how to get
>>>>> org.apache.mahout.classifier.df.mapreduce.TestForest
>>>>> to use more mappers?  Glancing at the code (and thinking about what
>>>>> is happening intuitively), it should be ripe for parallelization.
>>>>> 
>>>>> Thanks,
>>>>>       Adam
>> 

Reply via email to