Hi Asim, I think it's not necessary to back port featureImportances to mllib.tree.RandomForest. You can use ml.RandomForestClassifier and ml.RandomForestRegressor directly.
Yanbo 2015-12-17 19:39 GMT+08:00 Asim Jalis <[email protected]>: > Yanbo, > > Thanks for the reply. > > Is there a JIRA for exposing featureImportances on > org.apache.spark.mllib.tree.RandomForest?, or could you create one? I am > unable to create an issue on JIRA against Spark. > > Thanks. > > Asim > > On Thu, Dec 17, 2015 at 12:07 AM, Yanbo Liang <[email protected]> wrote: > >> Hi Asim, >> >> The "featureImportances" is only exposed at ML not MLlib. >> You need to update your code to use RandomForestClassifier of ML to train >> and get one RandomForestClassificationModel. Then you can call >> RandomForestClassificationModel.featureImportances >> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala#L237> >> to get the importances of each feature. >> >> For how to use RandomForestClassifier, you can refer this example >> <https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/RandomForestClassifierExample.scala> >> . >> >> Yanbo >> >> 2015-12-17 13:41 GMT+08:00 Asim Jalis <[email protected]>: >> >>> I wanted to use get feature importances related to a Random Forest as >>> described in this JIRA: https://issues.apache.org/jira/browse/SPARK-5133 >>> >>> However, I don’t see how to call this. I don't see any methods exposed >>> on >>> >>> org.apache.spark.mllib.tree.RandomForest >>> >>> How can I get featureImportances when I generate a RandomForest model in >>> this code? >>> >>> import org.apache.spark.mllib.linalg.Vectors >>> import org.apache.spark.mllib.regression.LabeledPoint >>> import org.apache.spark.mllib.tree.RandomForest >>> import org.apache.spark.mllib.tree.model.RandomForestModel >>> import org.apache.spark.mllib.util.MLUtils >>> import org.apache.spark.rdd.RDD >>> import util.Random >>> >>> def displayModel(model:RandomForestModel) = { >>> // Display model. >>> println("Learned classification tree model:\n" + model.toDebugString) >>> } >>> >>> def saveModel(model:RandomForestModel,path:String) = { >>> // Save and load model. >>> model.save(sc, path) >>> val sameModel = DecisionTreeModel.load(sc, path) >>> } >>> >>> def testModel(model:RandomForestModel,testData:RDD[LabeledPoint]) = { >>> // Test model. >>> val labelAndPreds = testData.map { point => >>> val prediction = model.predict(point.features) >>> (point.label, prediction) >>> } >>> val testErr = labelAndPreds. >>> filter(r => r._1 != r._2).count.toDouble / testData.count() >>> println("Test Error = " + testErr) >>> } >>> >>> def buildModel(trainingData:RDD[LabeledPoint], >>> numClasses:Int,categoricalFeaturesInfo:Map[Int,Int]) = { >>> val numTrees = 30 >>> val featureSubsetStrategy = "auto" >>> val impurity = "gini" >>> val maxDepth = 4 >>> val maxBins = 32 >>> >>> // Build model. >>> val model = RandomForest.trainClassifier( >>> trainingData, numClasses, categoricalFeaturesInfo, >>> numTrees, featureSubsetStrategy, impurity, maxDepth, >>> maxBins) >>> >>> model >>> } >>> >>> // Create plain RDD. >>> val rdd = sc.parallelize(Range(0,1000)) >>> >>> // Convert to LabeledPoint RDD. >>> val data = rdd. >>> map(x => { >>> val label = x % 2 >>> val feature1 = x % 5 >>> val feature2 = x % 7 >>> val features = Seq(feature1,feature2). >>> map(_.toDouble). >>> zipWithIndex. >>> map(_.swap) >>> val vector = Vectors.sparse(features.size, features) >>> val point = new LabeledPoint(label, vector) >>> point }) >>> >>> // Split data into training (70%) and test (30%). >>> val splits = data.randomSplit(Array(0.7, 0.3)) >>> val (trainingData, testData) = (splits(0), splits(1)) >>> >>> // Set up parameters for training. >>> val numClasses = data.map(_.label).distinct.count.toInt >>> val categoricalFeaturesInfo = Map[Int, Int]() >>> >>> val model = buildModel( >>> trainingData, >>> numClasses, >>> categoricalFeaturesInfo) >>> testModel(model,testData) >>> >>> >> >
