Hi, Below are what I typed in my scale-sql command line based on your first email, the result is different with yours. Just for your reference. My spark version is 1.6.1
import org.apache.spark.ml.feature._ import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.mllib.regression.LabeledPoint val df= sqlContext.createDataFrame(Seq( (0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"), (6, "d"))).toDF("id", "category") val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df) val indexed = indexer.transform(df) indexed.select("category", "categoryIndex").show() val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryVec") val encoded = encoder.transform(indexed) encoded.select("id", "category", "categoryVec").show() val data = encoded.rdd.map { x => { val featureVector = Vectors.dense(x.getAs[org.apache.spark.mllib.linalg.SparseVector]("categoryVec").toArray) val label = x.getAs[java.lang.Integer]("id").toDouble LabeledPoint(label, featureVector) } } var result = sqlContext.createDataFrame(data) scala> result.show() +-----+-------------+ |label| features| +-----+-------------+ | 0.0|[1.0,0.0,0.0]| | 1.0|[0.0,0.0,1.0]| | 2.0|[0.0,1.0,0.0]| | 3.0|[1.0,0.0,0.0]| | 4.0|[1.0,0.0,0.0]| | 5.0|[0.0,1.0,0.0]| | 6.0|[0.0,0.0,0.0]| +-----+-------------+ 发件人: Madabhattula Rajesh Kumar <mrajaf...@gmail.com<mailto:mrajaf...@gmail.com>> 日期: 2016年9月8日 星期四 下午2:10 至: "aka.fe2s" <aka.f...@gmail.com<mailto:aka.f...@gmail.com>> 抄送: "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>> 主题: Re: LabeledPoint creation Hi, I have done this in different way. Please correct me, is this approach right ? val df = spark.createDataFrame(Seq( (0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"), (6, "d"))).toDF("id", "category") val categories: List[String] = List("a", "b", "c", "d") val categoriesList: Array[Double] = new Array[Double](categories.size) val labelPoint = df.rdd.map { line => val values = line.getAs("category").toString() val id = line.getAs[java.lang.Integer]("id").toDouble var i = -1 categories.foreach { x => i += 1; categoriesList(i) = if (x == values) 1.0 else 0.0 } val denseVector = Vectors.dense(categoriesList) LabeledPoint(id, denseVector) } labelPoint.foreach { x => println(x) } Output :- (0.0,[1.0,0.0,0.0,0.0]) (1.0,[0.0,1.0,0.0,0.0]) (2.0,[0.0,0.0,1.0,0.0]) (3.0,[1.0,0.0,0.0,0.0]) (4.0,[1.0,0.0,0.0,0.0]) (5.0,[0.0,0.0,1.0,0.0]) (6.0,[0.0,0.0,0.0,1.0]) Regards, Rajesh On Thu, Sep 8, 2016 at 12:35 AM, aka.fe2s <aka.f...@gmail.com<mailto:aka.f...@gmail.com>> wrote: It has 4 categories a = 1 0 0 b = 0 0 0 c = 0 1 0 d = 0 0 1 -- Oleksiy Dyagilev On Wed, Sep 7, 2016 at 10:42 AM, Madabhattula Rajesh Kumar <mrajaf...@gmail.com<mailto:mrajaf...@gmail.com>> wrote: Hi, Any help on above mail use case ? Regards, Rajesh On Tue, Sep 6, 2016 at 5:40 PM, Madabhattula Rajesh Kumar <mrajaf...@gmail.com<mailto:mrajaf...@gmail.com>> wrote: Hi, I am new to Spark ML, trying to create a LabeledPoint from categorical dataset(example code from spark). For this, I am using One-hot encoding<http://en.wikipedia.org/wiki/One-hot> feature. Below is my code val df = sparkSession.createDataFrame(Seq( (0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"), (6, "d"))).toDF("id", "category") val indexer = new StringIndexer() .setInputCol("category") .setOutputCol("categoryIndex") .fit(df) val indexed = indexer.transform(df) indexed.select("category", "categoryIndex").show() val encoder = new OneHotEncoder() .setInputCol("categoryIndex") .setOutputCol("categoryVec") val encoded = encoder.transform(indexed) encoded.select("id", "category", "categoryVec").show() Output :- +---+--------+-------------+ | id|category| categoryVec| +---+--------+-------------+ | 0| a|(3,[0],[1.0])| | 1| b| (3,[],[])| | 2| c|(3,[1],[1.0])| | 3| a|(3,[0],[1.0])| | 4| a|(3,[0],[1.0])| | 5| c|(3,[1],[1.0])| | 6| d|(3,[2],[1.0])| +---+--------+-------------+ Creating LablePoint from encoded dataframe:- val data = encoded.rdd.map { x => { val featureVector = Vectors.dense(x.getAs[org.apache.spark.ml.linalg.SparseVector]("categoryVec").toArray) val label = x.getAs[java.lang.Integer]("id").toDouble LabeledPoint(label, featureVector) } } data.foreach { x => println(x) } Output :- (0.0,[1.0,0.0,0.0]) (1.0,[0.0,0.0,0.0]) (2.0,[0.0,1.0,0.0]) (3.0,[1.0,0.0,0.0]) (4.0,[1.0,0.0,0.0]) (5.0,[0.0,1.0,0.0]) (6.0,[0.0,0.0,1.0]) I have a four categorical values like a, b, c, d. I am expecting 4 features in the above LablePoint but it has only 3 features. Please help me to creation of LablePoint from categorical features. Regards, Rajesh