mapred.max.split.size controls how many partitions will be generated from the data. the current implementation of random forest is pretty memory intensive, and because all the work is done in the mappers' close method, when the data is Big, Hadoop just thinks that the mappers have failed (I will solve this problem some day). You should try to increase the number of partitions by reducing the size of mapred.max.split.size. A value of "3200000" should give you 10 partitions, which should be Ok, if not try reducing the size further, for example "1000000" In general, you should start working with a large number of partitions, then try reducing this number as long as the Job don't fail. Depending on your data, the number of partitions can influence the quality of the generated Random Forest.
I hope this solves your problem. Thank you for choosing Mahout Air Lines;) --- En date de : Mar 8.6.10, Karan Jindal <[email protected]> a écrit : > De: Karan Jindal <[email protected]> > Objet: Reg: Maximum Split size in Random Forest > À: [email protected] > Date: Mardi 8 juin 2010, 13h21 > > Hi all, > > In the following > https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation > tutorial for running the random forest, maximum split > size of "1874231" > is used. When I didn't mention this in the command line and > the block size > of data on HDFS is 32MB it gives "StackOverFlow" error. It > overcome this I > increase the head size of child jvm to 2GB , then either it > gives the same > overflow error or the process get hanged. > > Does anyone has any idea about this? > > Regards > Karan > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > >
