Hi all, I have been playing around with the Random Decision Forests in Mahout. Seems like the classifier produces good results using the test programs.
I am wondering if this classifier can be used on larger data sets with around 35,000 features and 100k+ message instances to classify on a small Hadoop cluster or even a single node development install? Has anyone used the Random forest classifier to work with massive data sets reliably and with high accuracy. My previous experience using the RF model has been good for sparse data sets and I think this is one area Mahout could really shine. Using tools like Weka and even R, the data sets I'm testing with now are just to large for these tools to work well so I was hoping Mahout may be the answer for this problem as well. So is it worth working with the Random Forest classifier to get a production or near to production system running? Does anyone have any examples and stories of their Mahout RF usage? Thanks!
