[How-To] Migrating from mllib.tree.DecisionTree to ml.regression.DecisionTreeRegressor

OBones Thu, 15 Jun 2017 03:00:12 -0700

Hello,

I have written the following scala code to train a regression tree,based on mllib:


    val conf = new SparkConf().setAppName("DecisionTreeRegressionExample")
    val sc = new SparkContext(conf)
    val spark = new SparkSession.Builder().getOrCreate()

val sourceData =spark.read.format("com.databricks.spark.csv").option("header","true").option("delimiter", ";").load("C:\\Data\\source_file.csv")

val data = sourceData.select($"X3".cast("double"),$"Y".cast("double"), $"X1".cast("double"), $"X2".cast("double"))

val featureIndices = List("X1", "X2","X3").map(data.columns.indexOf(_))

    val targetIndex = data.columns.indexOf("Y")

// WARNING: Indices in categoricalFeatures info are those insidethe vector we build from the featureIndices list

    // Column 0 has two modalities, Column 1 has three
    val categoricalFeaturesInfo = Map[Int, Int]((0, 2), (1, 3))
    val impurity = "variance"
    val maxDepth = 30
    val maxBins = 32

val labeled = data.map(row =>LabeledPoint(row.getDouble(targetIndex),Vectors.dense(featureIndices.map(row.getDouble(_)).toArray)))

val model = DecisionTree.trainRegressor(labeled.rdd,categoricalFeaturesInfo, impurity, maxDepth, maxBins)


    println(model.toDebugString)

This works quite well, but I want some information from the model, oneof them being the features importance values. As it turns out, this isnot available on DecisionTreeModel but is available onDecisionTreeRegressionModel from the ml package.I then discovered that the ml package is more recent than the mllibpackage which explains why it gives me more control over the trees I'mbuilding.So, I tried to rewrite my sample code using the ml package and it isvery much easier to use, no need for the LabeledPoint transformation.Here is the code I came up with:


    val dt = new DecisionTreeRegressor()
      .setPredictionCol("Y")
      .setImpurity("variance")
      .setMaxDepth(30)
      .setMaxBins(32)

    val model = dt.fit(data)

    println(model.toDebugString)
    println(model.featureImportances.toString)

However, I cannot find a way to specify which columns are features,which ones are categorical and how many categories they have, like Iused to do with the mllib package.I did look at the DecisionTreeRegressionExample.scala example found inthe source package, but it uses a VectorIndexer to automaticallydiscover the above information which is an unnecessary step in my casebecause I already have the information at hand.

The documentation found online(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.regression.DecisionTreeRegressor)did not help either because it does not indicate the format for thefeaturesCol string property.


Thanks in advance for your help.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[How-To] Migrating from mllib.tree.DecisionTree to ml.regression.DecisionTreeRegressor

Reply via email to