I'm attempting to run org.apache.mahout.classifier.df.mapreduce.TestForest
on a CSV with 200,000 rows that have 500,000 features per row.
 However, TestForest is  running extremely slow, likely because only 1
mapper was assigned to the job.  This seems strange because
the org.apache.mahout.classifier.df.mapreduce.BuildForest step on the same
data used 1772 mappers and took about 6 minutes.  (BTW: I know I
*shouldn't* use the same data set for the training and the testing steps;
this is purely a technical experiment to see if Mahout's Random Forest can
handle the data sizes we typically deal with).

Any idea on how to get org.apache.mahout.classifier.df.mapreduce.TestForest
to use more mappers?  Glancing at the code (and thinking about what is
happening intuitively), it should be ripe for parallelization.

Thanks,
        Adam

Reply via email to