Hi,

I'm trying out the ml.classification.RandomForestClassifer() on a simple
dataframe and it returns an exception that number of classes has not been
set in my dataframe. However, I cannot find a function that would set
number of classes, or pass it as an argument anywhere. In mllib, numClasses
is a parameter passed when training the model. In ml, there is an ugly hack
using StringIndexer, but should you really be using the hack?
LogisticRegression and NaiveBayes in ml work without setting the number of
classes.

Thanks for any pointers!
Kristina

My code:

import org.apache.spark.mllib.linalg.{Vector, Vectors}

case class Record(label:Double,
features:org.apache.spark.mllib.linalg.Vector)

val df = sc.parallelize(Seq( Record(0.0, Vectors.dense(1.0, 0.0) ),
                        Record(0.0, Vectors.dense(1.1, 0.0) ),
                        Record(0.0, Vectors.dense(1.2, 0.0) ),
                        Record(1.0, Vectors.dense(0.0, 1.2) ),
                        Record(1.0, Vectors.dense(0.0, 1.3) ),
                        Record(1.0, Vectors.dense(0.0, 1.7) ))
                       ).toDF()

val rf = new RandomForestClassifier()
val rfmodel = rf.fit(df)

And the error is:

scala> val rfmodel = rf.fit(df)
java.lang.IllegalArgumentException: RandomForestClassifier was given input
with invalid label column label, without the number of classes specified.
See StringIndexer.
at
org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:87)
at
org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:42)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:36)

Reply via email to