Hi, I am new to Spark ML, trying to create a LabeledPoint from categorical dataset(example code from spark). For this, I am using One-hot encoding <http://en.wikipedia.org/wiki/One-hot> feature. Below is my code
val df = sparkSession.createDataFrame(Seq( (0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"), (6, "d"))).toDF("id", "category") val indexer = new StringIndexer() .setInputCol("category") .setOutputCol("categoryIndex") .fit(df) val indexed = indexer.transform(df) indexed.select("category", "categoryIndex").show() val encoder = new OneHotEncoder() .setInputCol("categoryIndex") .setOutputCol("categoryVec") val encoded = encoder.transform(indexed) encoded.select("id", "category", "categoryVec").show() *Output :- * +---+--------+-------------+ | id|category| categoryVec| +---+--------+-------------+ | 0| a|(3,[0],[1.0])| | 1| b| (3,[],[])| | 2| c|(3,[1],[1.0])| | 3| a|(3,[0],[1.0])| | 4| a|(3,[0],[1.0])| | 5| c|(3,[1],[1.0])| | 6| d|(3,[2],[1.0])| +---+--------+-------------+ *Creating LablePoint from encoded dataframe:-* val data = encoded.rdd.map { x => { val featureVector = Vectors.dense(x.getAs[org.apache.spark.ml.linalg.SparseVector]("categoryVec").toArray) val label = x.getAs[java.lang.Integer]("id").toDouble LabeledPoint(label, featureVector) } } data.foreach { x => println(x) } *Output :-* (0.0,[1.0,0.0,0.0]) (1.0,[0.0,0.0,0.0]) (2.0,[0.0,1.0,0.0]) (3.0,[1.0,0.0,0.0]) (4.0,[1.0,0.0,0.0]) (5.0,[0.0,1.0,0.0]) (6.0,[0.0,0.0,1.0]) I have a four categorical values like a, b, c, d. I am expecting 4 features in the above LablePoint but it has only 3 features. Please help me to creation of LablePoint from categorical features. Regards, Rajesh