Hello,

I have written the following scala code to train a regression tree, based on mllib:

    val conf = new SparkConf().setAppName("DecisionTreeRegressionExample")
    val sc = new SparkContext(conf)
    val spark = new SparkSession.Builder().getOrCreate()

val sourceData = spark.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", ";").load("C:\\Data\\source_file.csv")

val data = sourceData.select($"X3".cast("double"), $"Y".cast("double"), $"X1".cast("double"), $"X2".cast("double"))

val featureIndices = List("X1", "X2", "X3").map(data.columns.indexOf(_))
    val targetIndex = data.columns.indexOf("Y")

// WARNING: Indices in categoricalFeatures info are those inside the vector we build from the featureIndices list
    // Column 0 has two modalities, Column 1 has three
    val categoricalFeaturesInfo = Map[Int, Int]((0, 2), (1, 3))
    val impurity = "variance"
    val maxDepth = 30
    val maxBins = 32

val labeled = data.map(row => LabeledPoint(row.getDouble(targetIndex), Vectors.dense(featureIndices.map(row.getDouble(_)).toArray)))

val model = DecisionTree.trainRegressor(labeled.rdd, categoricalFeaturesInfo, impurity, maxDepth, maxBins)

    println(model.toDebugString)

This works quite well, but I want some information from the model, one of them being the features importance values. As it turns out, this is not available on DecisionTreeModel but is available on DecisionTreeRegressionModel from the ml package. I then discovered that the ml package is more recent than the mllib package which explains why it gives me more control over the trees I'm building. So, I tried to rewrite my sample code using the ml package and it is very much easier to use, no need for the LabeledPoint transformation. Here is the code I came up with:

    val dt = new DecisionTreeRegressor()
      .setPredictionCol("Y")
      .setImpurity("variance")
      .setMaxDepth(30)
      .setMaxBins(32)

    val model = dt.fit(data)

    println(model.toDebugString)
    println(model.featureImportances.toString)

However, I cannot find a way to specify which columns are features, which ones are categorical and how many categories they have, like I used to do with the mllib package. I did look at the DecisionTreeRegressionExample.scala example found in the source package, but it uses a VectorIndexer to automatically discover the above information which is an unnecessary step in my case because I already have the information at hand.

The documentation found online (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.regression.DecisionTreeRegressor) did not help either because it does not indicate the format for the featuresCol string property.

Thanks in advance for your help.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to