I haven't had much luck with random forests (vs other stuff) in production. Its harder to control the regularization and thresholds. If you have 35K features chances are your data is linearly separable anyway, so you might as well stick to the logistic regression in Mahout.
On Fri, Jul 29, 2011 at 9:25 AM, Night Wolf <[email protected]> wrote: > Hi all, > > I have been playing around with the Random Decision Forests in Mahout. > Seems > like the classifier produces good results using the test programs. > > I am wondering if this classifier can be used on larger data sets with > around 35,000 features and 100k+ message instances to classify on a small > Hadoop cluster or even a single node development install? > > Has anyone used the Random forest classifier to work with massive data sets > reliably and with high accuracy. My previous experience using the RF model > has been good for sparse data sets and I think this is one area Mahout > could > really shine. Using tools like Weka and even R, the data sets I'm testing > with now are just to large for these tools to work well so I was hoping > Mahout may be the answer for this problem as well. > > So is it worth working with the Random Forest classifier to get a > production > or near to production system running? > > Does anyone have any examples and stories of their Mahout RF usage? > > Thanks! > -- Yee Yang Li Hector http://hectorgon.blogspot.com/ (tech + travel) http://hectorgon.com (book reviews)
