Ryan,

Thanks for the fix, the code looks reasonable to me.  Which version of
Mahout will this be in?  0.9?

Unfortunately, I'm using a large shared Hadoop cluster which is not
administered by my team.   So I'm not in a position push the latest from
the Mahout dev trunk into our environment; the admins will only install
official releases.

Regards,
          Adam

On Sun, Jul 28, 2013 at 5:37 PM, Ryan Josal <[email protected]> wrote:

> Late reply, but for what it's still worth, since I've seen a couple other
> threads here on the topic of too few mappers, I added a parameter to set a
> minimum number of mappers.  Some of my mahout jobs needed more mappers, but
> were not given many because of the small input file size.
>
>         addOption("minMapTasks", "m", "Minimum number of map tasks to
> run", String.valueOf(1));
>
>
>         int minMapTasks = Integer.parseInt(getOption("minMapTasks"));
>         int mapTasksThatWouldRun = (int)
> (vectorFileSizeBytes/getSplitSize()) + 1;
>         log.info("map tasks min: " + minMapTasks + " current: " +
> mapTasksThatWouldRun);
>         if (minMapTasks > mapTasksThatWouldRun) {
>             String splitSizeBytes =
> String.valueOf(vectorFileSizeBytes/minMapTasks);
>             log.info("Forcing mapred.max.split.size to " + splitSizeBytes
> + " to ensure minimum map tasks = " + minMapTasks);
>             hadoopConf.set("mapred.max.split.size", splitSizeBytes);
>         }
>
>     // there is actually a private method in hadoop to calculate this
>     private long getSplitSize() {
>         long blockSize = hadoopConf.getLong("dfs.block.size", 64 * 1024 *
> 1024);
>         long maxSize = hadoopConf.getLong("mapred.max.split.size",
> Long.MAX_VALUE);
>         int minSize = hadoopConf.getInt("mapred.min.split.size", 1);
>         long splitSize = Math.max(minSize, Math.min(maxSize, blockSize));
>         log.info(String.format("min: %,d block: %,d max: %,d split: %,d",
> minSize, blockSize, maxSize, splitSize));
>         return splitSize;
>     }
>
> It seems like there should be a more straightforward way to do this, but
> it works for me and I've used it on a lot of jobs to set a minimum number
> of mappers.
>
> Ryan
>
> On Jul 5, 2013, at 2:00 PM, Adam Baron wrote:
>
> > I'm attempting to run
> org.apache.mahout.classifier.df.mapreduce.TestForest
> > on a CSV with 200,000 rows that have 500,000 features per row.
> > However, TestForest is  running extremely slow, likely because only 1
> > mapper was assigned to the job.  This seems strange because
> > the org.apache.mahout.classifier.df.mapreduce.BuildForest step on the
> same
> > data used 1772 mappers and took about 6 minutes.  (BTW: I know I
> > *shouldn't* use the same data set for the training and the testing steps;
> > this is purely a technical experiment to see if Mahout's Random Forest
> can
> > handle the data sizes we typically deal with).
> >
> > Any idea on how to get
> org.apache.mahout.classifier.df.mapreduce.TestForest
> > to use more mappers?  Glancing at the code (and thinking about what is
> > happening intuitively), it should be ripe for parallelization.
> >
> > Thanks,
> >        Adam
>
>

Reply via email to