Hi,

I am new to Spark ML, trying to create a LabeledPoint from categorical
dataset(example code from spark). For this, I am using One-hot encoding
<http://en.wikipedia.org/wiki/One-hot> feature. Below is my code

val df = sparkSession.createDataFrame(Seq(
      (0, "a"),
      (1, "b"),
      (2, "c"),
      (3, "a"),
      (4, "a"),
      (5, "c"),
      (6, "d"))).toDF("id", "category")

    val indexer = new StringIndexer()
      .setInputCol("category")
      .setOutputCol("categoryIndex")
      .fit(df)

    val indexed = indexer.transform(df)

    indexed.select("category", "categoryIndex").show()

    val encoder = new OneHotEncoder()
      .setInputCol("categoryIndex")
      .setOutputCol("categoryVec")
    val encoded = encoder.transform(indexed)

     encoded.select("id", "category", "categoryVec").show()

*Output :- *
+---+--------+-------------+
| id|category|  categoryVec|
+---+--------+-------------+
|  0|       a|(3,[0],[1.0])|
|  1|       b|    (3,[],[])|
|  2|       c|(3,[1],[1.0])|
|  3|       a|(3,[0],[1.0])|
|  4|       a|(3,[0],[1.0])|
|  5|       c|(3,[1],[1.0])|
|  6|       d|(3,[2],[1.0])|
+---+--------+-------------+

*Creating LablePoint from encoded dataframe:-*

val data = encoded.rdd.map { x =>
      {
        val featureVector =
Vectors.dense(x.getAs[org.apache.spark.ml.linalg.SparseVector]("categoryVec").toArray)
        val label = x.getAs[java.lang.Integer]("id").toDouble
        LabeledPoint(label, featureVector)
      }
    }

    data.foreach { x => println(x) }

*Output :-*

(0.0,[1.0,0.0,0.0])
(1.0,[0.0,0.0,0.0])
(2.0,[0.0,1.0,0.0])
(3.0,[1.0,0.0,0.0])
(4.0,[1.0,0.0,0.0])
(5.0,[0.0,1.0,0.0])
(6.0,[0.0,0.0,1.0])

I have a four categorical values like a, b, c, d. I am expecting 4 features
in the above LablePoint but it has only 3 features.

Please help me to creation of LablePoint from categorical features.

Regards,
Rajesh

Reply via email to