Re : Reg: Maximum Split size in Random Forest

deneche abdelhakim Tue, 08 Jun 2010 21:20:06 -0700

mapred.max.split.size controls how many partitions will be generated from the 
data.
the current implementation of random forest is pretty memory intensive, and 
because all the work is done in the mappers' close method, when the data is 
Big, Hadoop just thinks that the mappers have failed (I will solve this problem 
some day).
You should try to increase the number of partitions by reducing the size of 
mapred.max.split.size. A value of "3200000" should give you 10 partitions, 
which should be Ok, if not try reducing the size further, for example "1000000"
In general, you should start working with a large number of partitions, then 
try reducing this number as long as the Job don't fail. Depending on your data, 
the number of partitions can influence the quality of the generated Random 
Forest.


I hope this solves your problem.

Thank you for choosing Mahout Air Lines;)

--- En date de : Mar 8.6.10, Karan Jindal <[email protected]> a 
écrit :

> De: Karan Jindal <[email protected]>
> Objet: Reg: Maximum Split size in Random Forest
> À: [email protected]
> Date: Mardi 8 juin 2010, 13h21
> 
> Hi all,
> 
> In the following
> https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation
> tutorial for running the random forest,  maximum split
> size of "1874231"
> is used. When I didn't mention this in the command line and
> the block size
> of data on HDFS is 32MB it gives "StackOverFlow" error. It
> overcome this I
> increase the head size of child jvm to 2GB , then either it
> gives the same
> overflow error  or the process get hanged.
> 
> Does anyone has any idea about this?
> 
> Regards
> Karan
> 
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 
>

Re : Reg: Maximum Split size in Random Forest

Reply via email to