Late reply, but for what it's still worth, since I've seen a couple other 
threads here on the topic of too few mappers, I added a parameter to set a 
minimum number of mappers.  Some of my mahout jobs needed more mappers, but 
were not given many because of the small input file size.

        addOption("minMapTasks", "m", "Minimum number of map tasks to run", 
String.valueOf(1));


        int minMapTasks = Integer.parseInt(getOption("minMapTasks"));
        int mapTasksThatWouldRun = (int) (vectorFileSizeBytes/getSplitSize()) + 
1;
        log.info("map tasks min: " + minMapTasks + " current: " + 
mapTasksThatWouldRun);
        if (minMapTasks > mapTasksThatWouldRun) {
            String splitSizeBytes = 
String.valueOf(vectorFileSizeBytes/minMapTasks);
            log.info("Forcing mapred.max.split.size to " + splitSizeBytes + " 
to ensure minimum map tasks = " + minMapTasks);
            hadoopConf.set("mapred.max.split.size", splitSizeBytes);
        }

    // there is actually a private method in hadoop to calculate this
    private long getSplitSize() {
        long blockSize = hadoopConf.getLong("dfs.block.size", 64 * 1024 * 1024);
        long maxSize = hadoopConf.getLong("mapred.max.split.size", 
Long.MAX_VALUE);
        int minSize = hadoopConf.getInt("mapred.min.split.size", 1);
        long splitSize = Math.max(minSize, Math.min(maxSize, blockSize));
        log.info(String.format("min: %,d block: %,d max: %,d split: %,d", 
minSize, blockSize, maxSize, splitSize));
        return splitSize;
    }

It seems like there should be a more straightforward way to do this, but it 
works for me and I've used it on a lot of jobs to set a minimum number of 
mappers.

Ryan

On Jul 5, 2013, at 2:00 PM, Adam Baron wrote:

> I'm attempting to run org.apache.mahout.classifier.df.mapreduce.TestForest
> on a CSV with 200,000 rows that have 500,000 features per row.
> However, TestForest is  running extremely slow, likely because only 1
> mapper was assigned to the job.  This seems strange because
> the org.apache.mahout.classifier.df.mapreduce.BuildForest step on the same
> data used 1772 mappers and took about 6 minutes.  (BTW: I know I
> *shouldn't* use the same data set for the training and the testing steps;
> this is purely a technical experiment to see if Mahout's Random Forest can
> handle the data sizes we typically deal with).
> 
> Any idea on how to get org.apache.mahout.classifier.df.mapreduce.TestForest
> to use more mappers?  Glancing at the code (and thinking about what is
> happening intuitively), it should be ripe for parallelization.
> 
> Thanks,
>        Adam

Reply via email to