Q1. the "InMem Mapred Implementation" should be used when the whole dataset can 
fin into memory (inmem), in this case every mapper will train a subset of the 
trees over the whole dataset. 
the "Partial Mapred Implementation" should be used if the dataset is big enough 
and cannot fit into memory or that training the trees over the entire dataset 
takes forever. This implementation splits the training into as many partitions 
as the available mappers. Each mapper will grow a subset of trees using its 
partition (partial access to the training data).
If possible you should use the "InMem Mapred Implementation" because every tree 
is grown using all available training data, but if you are using Mahout it's 
probably because the training data are big, so you have no other choice than 
using "Partial Mapred Implementation". That's being said, I've found that the 
partial implementation gives similar results to the Inmem implementation and 
works a lot faster because each mapper uses a subset of the training data.

Q2. Yes, all the attributes are considered at each node

Q3. The current implementation uses Information Gain to select the best split 
at each node. But the code is modular enough and allows you to use your own 
TreeBuilder when growing the trees (of course, for now only one implementation 
of TreeBuilder is available)

--- En date de : Lun 14.6.10, Karan Jindal <[email protected]> a 
écrit :

> De: Karan Jindal <[email protected]>
> Objet: Reg: Random Forest in mahout 0.3
> À: [email protected]
> Date: Lundi 14 juin 2010, 14h01
> 
> Hi all,
> I  have few questions about random forest. Can any one
> through light on
> the following questions?
> 
> Q1.what's the difference between "InMem Mapred
> implementation" and
> "Partial Mapred implementation"? Is there any performance
> (in terms of
> efficiency of random forest) trade off between the two?
> 
> Q2.In training total number of attributes are 18 and by
> mistake I gave 20
> (-sl 20) attributes in command line during training phase.
> In this case,
> do the implementation consider all the attributes while
> taking decision at
> a node?
> 
> Q3. which approach (information gain or entropy model)is
> used to classify
> the data at a given node?
> 
> --Karan
> 
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 
> 



Reply via email to