Re: LabeledPoint creation

2016-09-08 Thread 市场部
Hi,
Below are what I typed in my scale-sql command line based on your first email, 
the result is different with yours. Just for your reference.
My spark version is 1.6.1

import org.apache.spark.ml.feature._
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint

val df=sqlContext.createDataFrame(Seq(
  (0, "a"),
  (1, "b"),
  (2, "c"),
  (3, "a"),
  (4, "a"),
  (5, "c"),
  (6, "d"))).toDF("id", "category")

val indexer = new 
StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df)

val indexed = indexer.transform(df)

indexed.select("category", "categoryIndex").show()

val encoder = new 
OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)

 encoded.select("id", "category", "categoryVec").show()
val data = encoded.rdd.map { x =>
  {
val featureVector = 
Vectors.dense(x.getAs[org.apache.spark.mllib.linalg.SparseVector]("categoryVec").toArray)
val label = x.getAs[java.lang.Integer]("id").toDouble
LabeledPoint(label, featureVector)
  }
}
var result = sqlContext.createDataFrame(data)

scala> result.show()
+-+-+
|label| features|
+-+-+
|  0.0|[1.0,0.0,0.0]|
|  1.0|[0.0,0.0,1.0]|
|  2.0|[0.0,1.0,0.0]|
|  3.0|[1.0,0.0,0.0]|
|  4.0|[1.0,0.0,0.0]|
|  5.0|[0.0,1.0,0.0]|
|  6.0|[0.0,0.0,0.0]|
+-+-+
发件人: Madabhattula Rajesh Kumar mailto:mrajaf...@gmail.com>>
日期: 2016年9月8日 星期四 下午2:10
至: "aka.fe2s" mailto:aka.f...@gmail.com>>
抄送: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
主题: Re: LabeledPoint creation

Hi,

I have done this in different way. Please correct me, is this approach right ?

val df = spark.createDataFrame(Seq(
  (0, "a"),
  (1, "b"),
  (2, "c"),
  (3, "a"),
  (4, "a"),
  (5, "c"),
  (6, "d"))).toDF("id", "category")

   val categories: List[String] = List("a", "b", "c", "d")
val categoriesList: Array[Double] = new Array[Double](categories.size)
val labelPoint = df.rdd.map { line =>
  val values = line.getAs("category").toString()
  val id = line.getAs[java.lang.Integer]("id").toDouble
  var i = -1
  categories.foreach { x => i += 1; categoriesList(i) = if (x == values) 
1.0 else 0.0 }
  val denseVector = Vectors.dense(categoriesList)
  LabeledPoint(id, denseVector)
}
labelPoint.foreach { x => println(x) }

Output :-

(0.0,[1.0,0.0,0.0,0.0])
(1.0,[0.0,1.0,0.0,0.0])
(2.0,[0.0,0.0,1.0,0.0])
(3.0,[1.0,0.0,0.0,0.0])
(4.0,[1.0,0.0,0.0,0.0])
(5.0,[0.0,0.0,1.0,0.0])
(6.0,[0.0,0.0,0.0,1.0])

Regards,
Rajesh


On Thu, Sep 8, 2016 at 12:35 AM, aka.fe2s 
mailto:aka.f...@gmail.com>> wrote:
It has 4 categories
a = 1 0 0
b = 0 0 0
c = 0 1 0
d = 0 0 1

--
Oleksiy Dyagilev

On Wed, Sep 7, 2016 at 10:42 AM, Madabhattula Rajesh Kumar 
mailto:mrajaf...@gmail.com>> wrote:
Hi,

Any help on above mail use case ?

Regards,
Rajesh

On Tue, Sep 6, 2016 at 5:40 PM, Madabhattula Rajesh Kumar 
mailto:mrajaf...@gmail.com>> wrote:
Hi,

I am new to Spark ML, trying to create a LabeledPoint from categorical 
dataset(example code from spark). For this, I am using One-hot 
encoding<http://en.wikipedia.org/wiki/One-hot> feature. Below is my code

val df = sparkSession.createDataFrame(Seq(
  (0, "a"),
  (1, "b"),
  (2, "c"),
  (3, "a"),
  (4, "a"),
  (5, "c"),
  (6, "d"))).toDF("id", "category")

val indexer = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("categoryIndex")
  .fit(df)

val indexed = indexer.transform(df)

indexed.select("category", "categoryIndex").show()

val encoder = new OneHotEncoder()
  .setInputCol("categoryIndex")
  .setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)

 encoded.select("id", "category", "categoryVec").show()

Output :-
+---++-+
| id|category|  categoryVec|
+---++-+
|  0|   a|(3,[0],[1.0])|
|  1|   b|(3,[],[])|
|  2|   c|(3,[1],[1.0])|
|  3|   a|(3,[0],[1.0])|
|  4|   a|(3,[0],[1.0])|
|  5|   c|(3,[1],[1.0])|
|  6|   d|(3,[2],[1.0])|
+---++-+

Creating LablePoint

Re: LabeledPoint creation

2016-09-07 Thread Madabhattula Rajesh Kumar
Hi,

I have done this in different way. Please correct me, is this approach
right ?

val df = spark.createDataFrame(Seq(
  (0, "a"),
  (1, "b"),
  (2, "c"),
  (3, "a"),
  (4, "a"),
  (5, "c"),
  (6, "d"))).toDF("id", "category")

   val categories: List[String] = List("a", "b", "c", "d")
val categoriesList: Array[Double] = new Array[Double](categories.size)
val labelPoint = df.rdd.map { line =>
  val values = line.getAs("category").toString()
  val id = line.getAs[java.lang.Integer]("id").toDouble
  var i = -1
  categories.foreach { x => i += 1; categoriesList(i) = if (x ==
values) 1.0 else 0.0 }
  val denseVector = Vectors.dense(categoriesList)
  LabeledPoint(id, denseVector)
}
labelPoint.foreach { x => println(x) }











*Output :-
(0.0,[1.0,0.0,0.0,0.0])(1.0,[0.0,1.0,0.0,0.0])(2.0,[0.0,0.0,1.0,0.0])(3.0,[1.0,0.0,0.0,0.0])(4.0,[1.0,0.0,0.0,0.0])(5.0,[0.0,0.0,1.0,0.0])(6.0,[0.0,0.0,0.0,1.0])*
Regards,
Rajesh


On Thu, Sep 8, 2016 at 12:35 AM, aka.fe2s  wrote:

> It has 4 categories
> a = 1 0 0
> b = 0 0 0
> c = 0 1 0
> d = 0 0 1
>
> --
> Oleksiy Dyagilev
>
> On Wed, Sep 7, 2016 at 10:42 AM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi,
>>
>> Any help on above mail use case ?
>>
>> Regards,
>> Rajesh
>>
>> On Tue, Sep 6, 2016 at 5:40 PM, Madabhattula Rajesh Kumar <
>> mrajaf...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am new to Spark ML, trying to create a LabeledPoint from categorical
>>> dataset(example code from spark). For this, I am using One-hot encoding
>>>  feature. Below is my code
>>>
>>> val df = sparkSession.createDataFrame(Seq(
>>>   (0, "a"),
>>>   (1, "b"),
>>>   (2, "c"),
>>>   (3, "a"),
>>>   (4, "a"),
>>>   (5, "c"),
>>>   (6, "d"))).toDF("id", "category")
>>>
>>> val indexer = new StringIndexer()
>>>   .setInputCol("category")
>>>   .setOutputCol("categoryIndex")
>>>   .fit(df)
>>>
>>> val indexed = indexer.transform(df)
>>>
>>> indexed.select("category", "categoryIndex").show()
>>>
>>> val encoder = new OneHotEncoder()
>>>   .setInputCol("categoryIndex")
>>>   .setOutputCol("categoryVec")
>>> val encoded = encoder.transform(indexed)
>>>
>>>  encoded.select("id", "category", "categoryVec").show()
>>>
>>> *Output :- *
>>> +---++-+
>>> | id|category|  categoryVec|
>>> +---++-+
>>> |  0|   a|(3,[0],[1.0])|
>>> |  1|   b|(3,[],[])|
>>> |  2|   c|(3,[1],[1.0])|
>>> |  3|   a|(3,[0],[1.0])|
>>> |  4|   a|(3,[0],[1.0])|
>>> |  5|   c|(3,[1],[1.0])|
>>> |  6|   d|(3,[2],[1.0])|
>>> +---++-+
>>>
>>> *Creating LablePoint from encoded dataframe:-*
>>>
>>> val data = encoded.rdd.map { x =>
>>>   {
>>> val featureVector = Vectors.dense(x.getAs[org.apac
>>> he.spark.ml.linalg.SparseVector]("categoryVec").toArray)
>>> val label = x.getAs[java.lang.Integer]("id").toDouble
>>> LabeledPoint(label, featureVector)
>>>   }
>>> }
>>>
>>> data.foreach { x => println(x) }
>>>
>>> *Output :-*
>>>
>>> (0.0,[1.0,0.0,0.0])
>>> (1.0,[0.0,0.0,0.0])
>>> (2.0,[0.0,1.0,0.0])
>>> (3.0,[1.0,0.0,0.0])
>>> (4.0,[1.0,0.0,0.0])
>>> (5.0,[0.0,1.0,0.0])
>>> (6.0,[0.0,0.0,1.0])
>>>
>>> I have a four categorical values like a, b, c, d. I am expecting 4
>>> features in the above LablePoint but it has only 3 features.
>>>
>>> Please help me to creation of LablePoint from categorical features.
>>>
>>> Regards,
>>> Rajesh
>>>
>>>
>>>
>>
>


Re: LabeledPoint creation

2016-09-07 Thread aka.fe2s
It has 4 categories
a = 1 0 0
b = 0 0 0
c = 0 1 0
d = 0 0 1

--
Oleksiy Dyagilev

On Wed, Sep 7, 2016 at 10:42 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:

> Hi,
>
> Any help on above mail use case ?
>
> Regards,
> Rajesh
>
> On Tue, Sep 6, 2016 at 5:40 PM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi,
>>
>> I am new to Spark ML, trying to create a LabeledPoint from categorical
>> dataset(example code from spark). For this, I am using One-hot encoding
>>  feature. Below is my code
>>
>> val df = sparkSession.createDataFrame(Seq(
>>   (0, "a"),
>>   (1, "b"),
>>   (2, "c"),
>>   (3, "a"),
>>   (4, "a"),
>>   (5, "c"),
>>   (6, "d"))).toDF("id", "category")
>>
>> val indexer = new StringIndexer()
>>   .setInputCol("category")
>>   .setOutputCol("categoryIndex")
>>   .fit(df)
>>
>> val indexed = indexer.transform(df)
>>
>> indexed.select("category", "categoryIndex").show()
>>
>> val encoder = new OneHotEncoder()
>>   .setInputCol("categoryIndex")
>>   .setOutputCol("categoryVec")
>> val encoded = encoder.transform(indexed)
>>
>>  encoded.select("id", "category", "categoryVec").show()
>>
>> *Output :- *
>> +---++-+
>> | id|category|  categoryVec|
>> +---++-+
>> |  0|   a|(3,[0],[1.0])|
>> |  1|   b|(3,[],[])|
>> |  2|   c|(3,[1],[1.0])|
>> |  3|   a|(3,[0],[1.0])|
>> |  4|   a|(3,[0],[1.0])|
>> |  5|   c|(3,[1],[1.0])|
>> |  6|   d|(3,[2],[1.0])|
>> +---++-+
>>
>> *Creating LablePoint from encoded dataframe:-*
>>
>> val data = encoded.rdd.map { x =>
>>   {
>> val featureVector = Vectors.dense(x.getAs[org.apac
>> he.spark.ml.linalg.SparseVector]("categoryVec").toArray)
>> val label = x.getAs[java.lang.Integer]("id").toDouble
>> LabeledPoint(label, featureVector)
>>   }
>> }
>>
>> data.foreach { x => println(x) }
>>
>> *Output :-*
>>
>> (0.0,[1.0,0.0,0.0])
>> (1.0,[0.0,0.0,0.0])
>> (2.0,[0.0,1.0,0.0])
>> (3.0,[1.0,0.0,0.0])
>> (4.0,[1.0,0.0,0.0])
>> (5.0,[0.0,1.0,0.0])
>> (6.0,[0.0,0.0,1.0])
>>
>> I have a four categorical values like a, b, c, d. I am expecting 4
>> features in the above LablePoint but it has only 3 features.
>>
>> Please help me to creation of LablePoint from categorical features.
>>
>> Regards,
>> Rajesh
>>
>>
>>
>


Re: LabeledPoint creation

2016-09-07 Thread Madabhattula Rajesh Kumar
Hi,

Any help on above mail use case ?

Regards,
Rajesh

On Tue, Sep 6, 2016 at 5:40 PM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:

> Hi,
>
> I am new to Spark ML, trying to create a LabeledPoint from categorical
> dataset(example code from spark). For this, I am using One-hot encoding
>  feature. Below is my code
>
> val df = sparkSession.createDataFrame(Seq(
>   (0, "a"),
>   (1, "b"),
>   (2, "c"),
>   (3, "a"),
>   (4, "a"),
>   (5, "c"),
>   (6, "d"))).toDF("id", "category")
>
> val indexer = new StringIndexer()
>   .setInputCol("category")
>   .setOutputCol("categoryIndex")
>   .fit(df)
>
> val indexed = indexer.transform(df)
>
> indexed.select("category", "categoryIndex").show()
>
> val encoder = new OneHotEncoder()
>   .setInputCol("categoryIndex")
>   .setOutputCol("categoryVec")
> val encoded = encoder.transform(indexed)
>
>  encoded.select("id", "category", "categoryVec").show()
>
> *Output :- *
> +---++-+
> | id|category|  categoryVec|
> +---++-+
> |  0|   a|(3,[0],[1.0])|
> |  1|   b|(3,[],[])|
> |  2|   c|(3,[1],[1.0])|
> |  3|   a|(3,[0],[1.0])|
> |  4|   a|(3,[0],[1.0])|
> |  5|   c|(3,[1],[1.0])|
> |  6|   d|(3,[2],[1.0])|
> +---++-+
>
> *Creating LablePoint from encoded dataframe:-*
>
> val data = encoded.rdd.map { x =>
>   {
> val featureVector = Vectors.dense(x.getAs[org.
> apache.spark.ml.linalg.SparseVector]("categoryVec").toArray)
> val label = x.getAs[java.lang.Integer]("id").toDouble
> LabeledPoint(label, featureVector)
>   }
> }
>
> data.foreach { x => println(x) }
>
> *Output :-*
>
> (0.0,[1.0,0.0,0.0])
> (1.0,[0.0,0.0,0.0])
> (2.0,[0.0,1.0,0.0])
> (3.0,[1.0,0.0,0.0])
> (4.0,[1.0,0.0,0.0])
> (5.0,[0.0,1.0,0.0])
> (6.0,[0.0,0.0,1.0])
>
> I have a four categorical values like a, b, c, d. I am expecting 4
> features in the above LablePoint but it has only 3 features.
>
> Please help me to creation of LablePoint from categorical features.
>
> Regards,
> Rajesh
>
>
>