Re: MLlib: Feature Importances API

Yanbo Liang Thu, 17 Dec 2015 19:06:29 -0800

Hi Asim,

I think it's not necessary to back port featureImportances to
mllib.tree.RandomForest. You can use ml.RandomForestClassifier and
ml.RandomForestRegressor directly.


Yanbo

2015-12-17 19:39 GMT+08:00 Asim Jalis <[email protected]>:

> Yanbo,
>
> Thanks for the reply.
>
> Is there a JIRA for exposing featureImportances on
> org.apache.spark.mllib.tree.RandomForest?, or could you create one? I am
> unable to create an issue on JIRA against Spark.
>
> Thanks.
>
> Asim
>
> On Thu, Dec 17, 2015 at 12:07 AM, Yanbo Liang <[email protected]> wrote:
>
>> Hi Asim,
>>
>> The "featureImportances" is only exposed at ML not MLlib.
>> You need to update your code to use RandomForestClassifier of ML to train
>> and get one RandomForestClassificationModel. Then you can call
>> RandomForestClassificationModel.featureImportances
>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala#L237>
>> to get the importances of each feature.
>>
>> For how to use RandomForestClassifier, you can refer this example
>> <https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/RandomForestClassifierExample.scala>
>> .
>>
>> Yanbo
>>
>> 2015-12-17 13:41 GMT+08:00 Asim Jalis <[email protected]>:
>>
>>> I wanted to use get feature importances related to a Random Forest as
>>> described in this JIRA: https://issues.apache.org/jira/browse/SPARK-5133
>>>
>>> However, I don’t see how to call this. I don't see any methods exposed
>>> on
>>>
>>> org.apache.spark.mllib.tree.RandomForest
>>>
>>> How can I get featureImportances when I generate a RandomForest model in
>>> this code?
>>>
>>> import org.apache.spark.mllib.linalg.Vectors
>>> import org.apache.spark.mllib.regression.LabeledPoint
>>> import org.apache.spark.mllib.tree.RandomForest
>>> import org.apache.spark.mllib.tree.model.RandomForestModel
>>> import org.apache.spark.mllib.util.MLUtils
>>> import org.apache.spark.rdd.RDD
>>> import util.Random
>>>
>>> def displayModel(model:RandomForestModel) = {
>>>   // Display model.
>>>   println("Learned classification tree model:\n" + model.toDebugString)
>>> }
>>>
>>> def saveModel(model:RandomForestModel,path:String) = {
>>>   // Save and load model.
>>>   model.save(sc, path)
>>>   val sameModel = DecisionTreeModel.load(sc, path)
>>> }
>>>
>>> def testModel(model:RandomForestModel,testData:RDD[LabeledPoint]) = {
>>>   // Test model.
>>>   val labelAndPreds = testData.map { point =>
>>>     val prediction = model.predict(point.features)
>>>     (point.label, prediction)
>>>   }
>>>   val testErr = labelAndPreds.
>>>     filter(r => r._1 != r._2).count.toDouble / testData.count()
>>>   println("Test Error = " + testErr)
>>> }
>>>
>>> def buildModel(trainingData:RDD[LabeledPoint],
>>>   numClasses:Int,categoricalFeaturesInfo:Map[Int,Int]) = {
>>>   val numTrees = 30
>>>   val featureSubsetStrategy = "auto"
>>>   val impurity = "gini"
>>>   val maxDepth = 4
>>>   val maxBins = 32
>>>
>>>   // Build model.
>>>   val model = RandomForest.trainClassifier(
>>>     trainingData, numClasses, categoricalFeaturesInfo,
>>>     numTrees, featureSubsetStrategy, impurity, maxDepth,
>>>     maxBins)
>>>
>>>   model
>>> }
>>>
>>> // Create plain RDD.
>>> val rdd = sc.parallelize(Range(0,1000))
>>>
>>> // Convert to LabeledPoint RDD.
>>> val data = rdd.
>>>   map(x => {
>>>     val label = x % 2
>>>     val feature1 = x % 5
>>>     val feature2 = x % 7
>>>     val features = Seq(feature1,feature2).
>>>       map(_.toDouble).
>>>       zipWithIndex.
>>>       map(_.swap)
>>>     val vector = Vectors.sparse(features.size, features)
>>>     val point = new LabeledPoint(label, vector)
>>>     point })
>>>
>>> // Split data into training (70%) and test (30%).
>>> val splits = data.randomSplit(Array(0.7, 0.3))
>>> val (trainingData, testData) = (splits(0), splits(1))
>>>
>>> // Set up parameters for training.
>>> val numClasses = data.map(_.label).distinct.count.toInt
>>> val categoricalFeaturesInfo = Map[Int, Int]()
>>>
>>> val model = buildModel(
>>>     trainingData,
>>>     numClasses,
>>>     categoricalFeaturesInfo)
>>> testModel(model,testData)
>>>
>>>
>>
>

Re: MLlib: Feature Importances API

Reply via email to