Hello,
I have written the following scala code to train a regression tree,
based on mllib:
val conf = new SparkConf().setAppName("DecisionTreeRegressionExample")
val sc = new SparkContext(conf)
val spark = new SparkSession.Builder().getOrCreate()
val sourceData =
spark.read.format("com.databricks.spark.csv").option("header",
"true").option("delimiter", ";").load("C:\\Data\\source_file.csv")
val data = sourceData.select($"X3".cast("double"),
$"Y".cast("double"), $"X1".cast("double"), $"X2".cast("double"))
val featureIndices = List("X1", "X2",
"X3").map(data.columns.indexOf(_))
val targetIndex = data.columns.indexOf("Y")
// WARNING: Indices in categoricalFeatures info are those inside
the vector we build from the featureIndices list
// Column 0 has two modalities, Column 1 has three
val categoricalFeaturesInfo = Map[Int, Int]((0, 2), (1, 3))
val impurity = "variance"
val maxDepth = 30
val maxBins = 32
val labeled = data.map(row =>
LabeledPoint(row.getDouble(targetIndex),
Vectors.dense(featureIndices.map(row.getDouble(_)).toArray)))
val model = DecisionTree.trainRegressor(labeled.rdd,
categoricalFeaturesInfo, impurity, maxDepth, maxBins)
println(model.toDebugString)
This works quite well, but I want some information from the model, one
of them being the features importance values. As it turns out, this is
not available on DecisionTreeModel but is available on
DecisionTreeRegressionModel from the ml package.
I then discovered that the ml package is more recent than the mllib
package which explains why it gives me more control over the trees I'm
building.
So, I tried to rewrite my sample code using the ml package and it is
very much easier to use, no need for the LabeledPoint transformation.
Here is the code I came up with:
val dt = new DecisionTreeRegressor()
.setPredictionCol("Y")
.setImpurity("variance")
.setMaxDepth(30)
.setMaxBins(32)
val model = dt.fit(data)
println(model.toDebugString)
println(model.featureImportances.toString)
However, I cannot find a way to specify which columns are features,
which ones are categorical and how many categories they have, like I
used to do with the mllib package.
I did look at the DecisionTreeRegressionExample.scala example found in
the source package, but it uses a VectorIndexer to automatically
discover the above information which is an unnecessary step in my case
because I already have the information at hand.
The documentation found online
(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.regression.DecisionTreeRegressor)
did not help either because it does not indicate the format for the
featuresCol string property.
Thanks in advance for your help.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org