Re: Dataframe, Java: How to convert String to Vector ?

2016-10-02 Thread Yan Facai
Hi, Perter.

It's interesting that `DecisionTreeRegressor.transformImpl` also use udf to
transform dataframe, instead of using map:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L175


On Wed, Sep 21, 2016 at 10:22 PM, Peter Figliozzi 
wrote:

> I'm sure there's another way to do it; I hope someone can show us.  I
> couldn't figure out how to use `map` either.
>
> On Wed, Sep 21, 2016 at 3:32 AM, 颜发才(Yan Facai)  wrote:
>
>> Thanks, Peter.
>> It works!
>>
>> Why udf is needed?
>>
>>
>>
>>
>> On Wed, Sep 21, 2016 at 12:00 AM, Peter Figliozzi <
>> pete.figlio...@gmail.com> wrote:
>>
>>> Hi Yan, I agree, it IS really confusing.  Here is the technique for
>>> transforming a column.  It is very general because you can make "myConvert"
>>> do whatever you want.
>>>
>>> import org.apache.spark.mllib.linalg.Vectors
>>> val df = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF
>>>
>>> df.show()
>>> // The columns were named "_1" and "_2"
>>> // Very confusing, because it looks like a Scala wildcard when we refer
>>> to it in code
>>>
>>> val myConvert = (x: String) => { Vectors.parse(x) }
>>> val myConvertUDF = udf(myConvert)
>>>
>>> val newDf = df.withColumn("parsed", myConvertUDF(col("_2")))
>>>
>>> newDf.show()
>>>
>>> On Mon, Sep 19, 2016 at 3:29 AM, 颜发才(Yan Facai) 
>>> wrote:
>>>
 Hi, all.
 I find that it's really confuse.

 I can use Vectors.parse to create a DataFrame contains Vector type.

 scala> val dataVec = Seq((0, Vectors.parse("[1,3,5]")), (1,
 Vectors.parse("[2,4,6]"))).toDF
 dataVec: org.apache.spark.sql.DataFrame = [_1: int, _2: vector]


 But using map to convert String to Vector throws an error:

 scala> val dataStr = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF
 dataStr: org.apache.spark.sql.DataFrame = [_1: int, _2: string]

 scala> dataStr.map(row => Vectors.parse(row.getString(1)))
 :30: error: Unable to find encoder for type stored in a
 Dataset.  Primitive types (Int, String, etc) and Product types (case
 classes) are supported by importing spark.implicits._  Support for
 serializing other types will be added in future releases.
   dataStr.map(row => Vectors.parse(row.getString(1)))


 Dose anyone can help me,
 thanks very much!







 On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi <
 pete.figlio...@gmail.com> wrote:

> Hi Yan, I think you'll have to map the features column to a new
> numerical features column.
>
> Here's one way to do the individual transform:
>
> scala> val x = "[1, 2, 3, 4, 5]"
> x: String = [1, 2, 3, 4, 5]
>
> scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "")
> split(" ") map(_.toInt)
> y: Array[Int] = Array(1, 2, 3, 4, 5)
>
> If you don't know about the Scala command line, just type "scala" in a
> terminal window.  It's a good place to try things out.
>
> You can make a function out of this transformation and apply it to
> your features column to make a new column.  Then add this with
> Dataset.withColumn.
>
> See here
> 
> on how to apply a function to a Column to make a new column.
>
> On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai) 
> wrote:
>
>> Hi,
>> I have a csv file like:
>> uid  mid  features   label
>> 1235231[0, 1, 3, ...]True
>>
>> Both  "features" and "label" columns are used for GBTClassifier.
>>
>> However, when I read the file:
>> Dataset samples = sparkSession.read().csv(file);
>> The type of samples.select("features") is String.
>>
>> My question is:
>> How to map samples.select("features") to Vector or any appropriate
>> type,
>> so I can use it to train like:
>> GBTClassifier gbdt = new GBTClassifier()
>> .setLabelCol("label")
>> .setFeaturesCol("features")
>> .setMaxIter(2)
>> .setMaxDepth(7);
>>
>> Thanks.
>>
>
>

>>>
>>
>


Re: Dataframe, Java: How to convert String to Vector ?

2016-09-21 Thread Peter Figliozzi
I'm sure there's another way to do it; I hope someone can show us.  I
couldn't figure out how to use `map` either.

On Wed, Sep 21, 2016 at 3:32 AM, 颜发才(Yan Facai)  wrote:

> Thanks, Peter.
> It works!
>
> Why udf is needed?
>
>
>
>
> On Wed, Sep 21, 2016 at 12:00 AM, Peter Figliozzi <
> pete.figlio...@gmail.com> wrote:
>
>> Hi Yan, I agree, it IS really confusing.  Here is the technique for
>> transforming a column.  It is very general because you can make "myConvert"
>> do whatever you want.
>>
>> import org.apache.spark.mllib.linalg.Vectors
>> val df = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF
>>
>> df.show()
>> // The columns were named "_1" and "_2"
>> // Very confusing, because it looks like a Scala wildcard when we refer
>> to it in code
>>
>> val myConvert = (x: String) => { Vectors.parse(x) }
>> val myConvertUDF = udf(myConvert)
>>
>> val newDf = df.withColumn("parsed", myConvertUDF(col("_2")))
>>
>> newDf.show()
>>
>> On Mon, Sep 19, 2016 at 3:29 AM, 颜发才(Yan Facai)  wrote:
>>
>>> Hi, all.
>>> I find that it's really confuse.
>>>
>>> I can use Vectors.parse to create a DataFrame contains Vector type.
>>>
>>> scala> val dataVec = Seq((0, Vectors.parse("[1,3,5]")), (1,
>>> Vectors.parse("[2,4,6]"))).toDF
>>> dataVec: org.apache.spark.sql.DataFrame = [_1: int, _2: vector]
>>>
>>>
>>> But using map to convert String to Vector throws an error:
>>>
>>> scala> val dataStr = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF
>>> dataStr: org.apache.spark.sql.DataFrame = [_1: int, _2: string]
>>>
>>> scala> dataStr.map(row => Vectors.parse(row.getString(1)))
>>> :30: error: Unable to find encoder for type stored in a
>>> Dataset.  Primitive types (Int, String, etc) and Product types (case
>>> classes) are supported by importing spark.implicits._  Support for
>>> serializing other types will be added in future releases.
>>>   dataStr.map(row => Vectors.parse(row.getString(1)))
>>>
>>>
>>> Dose anyone can help me,
>>> thanks very much!
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi <
>>> pete.figlio...@gmail.com> wrote:
>>>
 Hi Yan, I think you'll have to map the features column to a new
 numerical features column.

 Here's one way to do the individual transform:

 scala> val x = "[1, 2, 3, 4, 5]"
 x: String = [1, 2, 3, 4, 5]

 scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "")
 split(" ") map(_.toInt)
 y: Array[Int] = Array(1, 2, 3, 4, 5)

 If you don't know about the Scala command line, just type "scala" in a
 terminal window.  It's a good place to try things out.

 You can make a function out of this transformation and apply it to your
 features column to make a new column.  Then add this with
 Dataset.withColumn.

 See here
 
 on how to apply a function to a Column to make a new column.

 On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai) 
 wrote:

> Hi,
> I have a csv file like:
> uid  mid  features   label
> 1235231[0, 1, 3, ...]True
>
> Both  "features" and "label" columns are used for GBTClassifier.
>
> However, when I read the file:
> Dataset samples = sparkSession.read().csv(file);
> The type of samples.select("features") is String.
>
> My question is:
> How to map samples.select("features") to Vector or any appropriate
> type,
> so I can use it to train like:
> GBTClassifier gbdt = new GBTClassifier()
> .setLabelCol("label")
> .setFeaturesCol("features")
> .setMaxIter(2)
> .setMaxDepth(7);
>
> Thanks.
>


>>>
>>
>


Re: Dataframe, Java: How to convert String to Vector ?

2016-09-21 Thread Yan Facai
Thanks, Peter.
It works!

Why udf is needed?




On Wed, Sep 21, 2016 at 12:00 AM, Peter Figliozzi 
wrote:

> Hi Yan, I agree, it IS really confusing.  Here is the technique for
> transforming a column.  It is very general because you can make "myConvert"
> do whatever you want.
>
> import org.apache.spark.mllib.linalg.Vectors
> val df = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF
>
> df.show()
> // The columns were named "_1" and "_2"
> // Very confusing, because it looks like a Scala wildcard when we refer to
> it in code
>
> val myConvert = (x: String) => { Vectors.parse(x) }
> val myConvertUDF = udf(myConvert)
>
> val newDf = df.withColumn("parsed", myConvertUDF(col("_2")))
>
> newDf.show()
>
> On Mon, Sep 19, 2016 at 3:29 AM, 颜发才(Yan Facai)  wrote:
>
>> Hi, all.
>> I find that it's really confuse.
>>
>> I can use Vectors.parse to create a DataFrame contains Vector type.
>>
>> scala> val dataVec = Seq((0, Vectors.parse("[1,3,5]")), (1,
>> Vectors.parse("[2,4,6]"))).toDF
>> dataVec: org.apache.spark.sql.DataFrame = [_1: int, _2: vector]
>>
>>
>> But using map to convert String to Vector throws an error:
>>
>> scala> val dataStr = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF
>> dataStr: org.apache.spark.sql.DataFrame = [_1: int, _2: string]
>>
>> scala> dataStr.map(row => Vectors.parse(row.getString(1)))
>> :30: error: Unable to find encoder for type stored in a
>> Dataset.  Primitive types (Int, String, etc) and Product types (case
>> classes) are supported by importing spark.implicits._  Support for
>> serializing other types will be added in future releases.
>>   dataStr.map(row => Vectors.parse(row.getString(1)))
>>
>>
>> Dose anyone can help me,
>> thanks very much!
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi > > wrote:
>>
>>> Hi Yan, I think you'll have to map the features column to a new
>>> numerical features column.
>>>
>>> Here's one way to do the individual transform:
>>>
>>> scala> val x = "[1, 2, 3, 4, 5]"
>>> x: String = [1, 2, 3, 4, 5]
>>>
>>> scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "")
>>> split(" ") map(_.toInt)
>>> y: Array[Int] = Array(1, 2, 3, 4, 5)
>>>
>>> If you don't know about the Scala command line, just type "scala" in a
>>> terminal window.  It's a good place to try things out.
>>>
>>> You can make a function out of this transformation and apply it to your
>>> features column to make a new column.  Then add this with
>>> Dataset.withColumn.
>>>
>>> See here
>>> 
>>> on how to apply a function to a Column to make a new column.
>>>
>>> On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai)  wrote:
>>>
 Hi,
 I have a csv file like:
 uid  mid  features   label
 1235231[0, 1, 3, ...]True

 Both  "features" and "label" columns are used for GBTClassifier.

 However, when I read the file:
 Dataset samples = sparkSession.read().csv(file);
 The type of samples.select("features") is String.

 My question is:
 How to map samples.select("features") to Vector or any appropriate type,
 so I can use it to train like:
 GBTClassifier gbdt = new GBTClassifier()
 .setLabelCol("label")
 .setFeaturesCol("features")
 .setMaxIter(2)
 .setMaxDepth(7);

 Thanks.

>>>
>>>
>>
>


Re: Dataframe, Java: How to convert String to Vector ?

2016-09-20 Thread Peter Figliozzi
Hi Yan, I agree, it IS really confusing.  Here is the technique for
transforming a column.  It is very general because you can make "myConvert"
do whatever you want.

import org.apache.spark.mllib.linalg.Vectors
val df = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF

df.show()
// The columns were named "_1" and "_2"
// Very confusing, because it looks like a Scala wildcard when we refer to
it in code

val myConvert = (x: String) => { Vectors.parse(x) }
val myConvertUDF = udf(myConvert)

val newDf = df.withColumn("parsed", myConvertUDF(col("_2")))

newDf.show()

On Mon, Sep 19, 2016 at 3:29 AM, 颜发才(Yan Facai)  wrote:

> Hi, all.
> I find that it's really confuse.
>
> I can use Vectors.parse to create a DataFrame contains Vector type.
>
> scala> val dataVec = Seq((0, Vectors.parse("[1,3,5]")), (1,
> Vectors.parse("[2,4,6]"))).toDF
> dataVec: org.apache.spark.sql.DataFrame = [_1: int, _2: vector]
>
>
> But using map to convert String to Vector throws an error:
>
> scala> val dataStr = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF
> dataStr: org.apache.spark.sql.DataFrame = [_1: int, _2: string]
>
> scala> dataStr.map(row => Vectors.parse(row.getString(1)))
> :30: error: Unable to find encoder for type stored in a
> Dataset.  Primitive types (Int, String, etc) and Product types (case
> classes) are supported by importing spark.implicits._  Support for
> serializing other types will be added in future releases.
>   dataStr.map(row => Vectors.parse(row.getString(1)))
>
>
> Dose anyone can help me,
> thanks very much!
>
>
>
>
>
>
>
> On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi 
> wrote:
>
>> Hi Yan, I think you'll have to map the features column to a new numerical
>> features column.
>>
>> Here's one way to do the individual transform:
>>
>> scala> val x = "[1, 2, 3, 4, 5]"
>> x: String = [1, 2, 3, 4, 5]
>>
>> scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "")
>> split(" ") map(_.toInt)
>> y: Array[Int] = Array(1, 2, 3, 4, 5)
>>
>> If you don't know about the Scala command line, just type "scala" in a
>> terminal window.  It's a good place to try things out.
>>
>> You can make a function out of this transformation and apply it to your
>> features column to make a new column.  Then add this with
>> Dataset.withColumn.
>>
>> See here
>> 
>> on how to apply a function to a Column to make a new column.
>>
>> On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai)  wrote:
>>
>>> Hi,
>>> I have a csv file like:
>>> uid  mid  features   label
>>> 1235231[0, 1, 3, ...]True
>>>
>>> Both  "features" and "label" columns are used for GBTClassifier.
>>>
>>> However, when I read the file:
>>> Dataset samples = sparkSession.read().csv(file);
>>> The type of samples.select("features") is String.
>>>
>>> My question is:
>>> How to map samples.select("features") to Vector or any appropriate type,
>>> so I can use it to train like:
>>> GBTClassifier gbdt = new GBTClassifier()
>>> .setLabelCol("label")
>>> .setFeaturesCol("features")
>>> .setMaxIter(2)
>>> .setMaxDepth(7);
>>>
>>> Thanks.
>>>
>>
>>
>


Re: Dataframe, Java: How to convert String to Vector ?

2016-09-19 Thread Yan Facai
Hi, all.
I find that it's really confuse.

I can use Vectors.parse to create a DataFrame contains Vector type.

scala> val dataVec = Seq((0, Vectors.parse("[1,3,5]")), (1,
Vectors.parse("[2,4,6]"))).toDF
dataVec: org.apache.spark.sql.DataFrame = [_1: int, _2: vector]


But using map to convert String to Vector throws an error:

scala> val dataStr = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF
dataStr: org.apache.spark.sql.DataFrame = [_1: int, _2: string]

scala> dataStr.map(row => Vectors.parse(row.getString(1)))
:30: error: Unable to find encoder for type stored in a
Dataset.  Primitive types (Int, String, etc) and Product types (case
classes) are supported by importing spark.implicits._  Support for
serializing other types will be added in future releases.
  dataStr.map(row => Vectors.parse(row.getString(1)))


Dose anyone can help me,
thanks very much!







On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi 
wrote:

> Hi Yan, I think you'll have to map the features column to a new numerical
> features column.
>
> Here's one way to do the individual transform:
>
> scala> val x = "[1, 2, 3, 4, 5]"
> x: String = [1, 2, 3, 4, 5]
>
> scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "")
> split(" ") map(_.toInt)
> y: Array[Int] = Array(1, 2, 3, 4, 5)
>
> If you don't know about the Scala command line, just type "scala" in a
> terminal window.  It's a good place to try things out.
>
> You can make a function out of this transformation and apply it to your
> features column to make a new column.  Then add this with
> Dataset.withColumn.
>
> See here
> 
> on how to apply a function to a Column to make a new column.
>
> On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai)  wrote:
>
>> Hi,
>> I have a csv file like:
>> uid  mid  features   label
>> 1235231[0, 1, 3, ...]True
>>
>> Both  "features" and "label" columns are used for GBTClassifier.
>>
>> However, when I read the file:
>> Dataset samples = sparkSession.read().csv(file);
>> The type of samples.select("features") is String.
>>
>> My question is:
>> How to map samples.select("features") to Vector or any appropriate type,
>> so I can use it to train like:
>> GBTClassifier gbdt = new GBTClassifier()
>> .setLabelCol("label")
>> .setFeaturesCol("features")
>> .setMaxIter(2)
>> .setMaxDepth(7);
>>
>> Thanks.
>>
>
>


Re: Dataframe, Java: How to convert String to Vector ?

2016-09-08 Thread Yan Facai
many thanks, Peter.

On Wed, Sep 7, 2016 at 10:14 PM, Peter Figliozzi 
wrote:

> Here's a decent GitHub book: Mastering Apache Spark
> 
> .
>
> I'm new at Scala too.  I found it very helpful to study the Scala language
> without Spark.  The documentation found here
>  is excellent.
>
> Pete
>
> On Wed, Sep 7, 2016 at 1:39 AM, 颜发才(Yan Facai)  wrote:
>
>> Hi Peter,
>> I'm familiar with Pandas / Numpy in python,  while spark / scala is
>> totally new for me.
>> Pandas provides a detailed document, like how to slice data, parse file,
>> use apply and filter function.
>>
>> Do spark have some more detailed document?
>>
>>
>>
>> On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi > > wrote:
>>
>>> Hi Yan, I think you'll have to map the features column to a new
>>> numerical features column.
>>>
>>> Here's one way to do the individual transform:
>>>
>>> scala> val x = "[1, 2, 3, 4, 5]"
>>> x: String = [1, 2, 3, 4, 5]
>>>
>>> scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "")
>>> split(" ") map(_.toInt)
>>> y: Array[Int] = Array(1, 2, 3, 4, 5)
>>>
>>> If you don't know about the Scala command line, just type "scala" in a
>>> terminal window.  It's a good place to try things out.
>>>
>>> You can make a function out of this transformation and apply it to your
>>> features column to make a new column.  Then add this with
>>> Dataset.withColumn.
>>>
>>> See here
>>> 
>>> on how to apply a function to a Column to make a new column.
>>>
>>> On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai)  wrote:
>>>
 Hi,
 I have a csv file like:
 uid  mid  features   label
 1235231[0, 1, 3, ...]True

 Both  "features" and "label" columns are used for GBTClassifier.

 However, when I read the file:
 Dataset samples = sparkSession.read().csv(file);
 The type of samples.select("features") is String.

 My question is:
 How to map samples.select("features") to Vector or any appropriate type,
 so I can use it to train like:
 GBTClassifier gbdt = new GBTClassifier()
 .setLabelCol("label")
 .setFeaturesCol("features")
 .setMaxIter(2)
 .setMaxDepth(7);

 Thanks.

>>>
>>>
>>
>


Re: Dataframe, Java: How to convert String to Vector ?

2016-09-07 Thread Peter Figliozzi
Here's a decent GitHub book: Mastering Apache Spark

.

I'm new at Scala too.  I found it very helpful to study the Scala language
without Spark.  The documentation found here
 is excellent.

Pete

On Wed, Sep 7, 2016 at 1:39 AM, 颜发才(Yan Facai)  wrote:

> Hi Peter,
> I'm familiar with Pandas / Numpy in python,  while spark / scala is
> totally new for me.
> Pandas provides a detailed document, like how to slice data, parse file,
> use apply and filter function.
>
> Do spark have some more detailed document?
>
>
>
> On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi 
> wrote:
>
>> Hi Yan, I think you'll have to map the features column to a new numerical
>> features column.
>>
>> Here's one way to do the individual transform:
>>
>> scala> val x = "[1, 2, 3, 4, 5]"
>> x: String = [1, 2, 3, 4, 5]
>>
>> scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "")
>> split(" ") map(_.toInt)
>> y: Array[Int] = Array(1, 2, 3, 4, 5)
>>
>> If you don't know about the Scala command line, just type "scala" in a
>> terminal window.  It's a good place to try things out.
>>
>> You can make a function out of this transformation and apply it to your
>> features column to make a new column.  Then add this with
>> Dataset.withColumn.
>>
>> See here
>> 
>> on how to apply a function to a Column to make a new column.
>>
>> On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai)  wrote:
>>
>>> Hi,
>>> I have a csv file like:
>>> uid  mid  features   label
>>> 1235231[0, 1, 3, ...]True
>>>
>>> Both  "features" and "label" columns are used for GBTClassifier.
>>>
>>> However, when I read the file:
>>> Dataset samples = sparkSession.read().csv(file);
>>> The type of samples.select("features") is String.
>>>
>>> My question is:
>>> How to map samples.select("features") to Vector or any appropriate type,
>>> so I can use it to train like:
>>> GBTClassifier gbdt = new GBTClassifier()
>>> .setLabelCol("label")
>>> .setFeaturesCol("features")
>>> .setMaxIter(2)
>>> .setMaxDepth(7);
>>>
>>> Thanks.
>>>
>>
>>
>


Re: Dataframe, Java: How to convert String to Vector ?

2016-09-06 Thread Yan Facai
Hi Peter,
I'm familiar with Pandas / Numpy in python,  while spark / scala is totally
new for me.
Pandas provides a detailed document, like how to slice data, parse file,
use apply and filter function.

Do spark have some more detailed document?



On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi 
wrote:

> Hi Yan, I think you'll have to map the features column to a new numerical
> features column.
>
> Here's one way to do the individual transform:
>
> scala> val x = "[1, 2, 3, 4, 5]"
> x: String = [1, 2, 3, 4, 5]
>
> scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "")
> split(" ") map(_.toInt)
> y: Array[Int] = Array(1, 2, 3, 4, 5)
>
> If you don't know about the Scala command line, just type "scala" in a
> terminal window.  It's a good place to try things out.
>
> You can make a function out of this transformation and apply it to your
> features column to make a new column.  Then add this with
> Dataset.withColumn.
>
> See here
> 
> on how to apply a function to a Column to make a new column.
>
> On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai)  wrote:
>
>> Hi,
>> I have a csv file like:
>> uid  mid  features   label
>> 1235231[0, 1, 3, ...]True
>>
>> Both  "features" and "label" columns are used for GBTClassifier.
>>
>> However, when I read the file:
>> Dataset samples = sparkSession.read().csv(file);
>> The type of samples.select("features") is String.
>>
>> My question is:
>> How to map samples.select("features") to Vector or any appropriate type,
>> so I can use it to train like:
>> GBTClassifier gbdt = new GBTClassifier()
>> .setLabelCol("label")
>> .setFeaturesCol("features")
>> .setMaxIter(2)
>> .setMaxDepth(7);
>>
>> Thanks.
>>
>
>


Re: Dataframe, Java: How to convert String to Vector ?

2016-09-06 Thread Peter Figliozzi
Hi Yan, I think you'll have to map the features column to a new numerical
features column.

Here's one way to do the individual transform:

scala> val x = "[1, 2, 3, 4, 5]"
x: String = [1, 2, 3, 4, 5]

scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "") split("
") map(_.toInt)
y: Array[Int] = Array(1, 2, 3, 4, 5)

If you don't know about the Scala command line, just type "scala" in a
terminal window.  It's a good place to try things out.

You can make a function out of this transformation and apply it to your
features column to make a new column.  Then add this with
Dataset.withColumn.

See here

on how to apply a function to a Column to make a new column.

On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai)  wrote:

> Hi,
> I have a csv file like:
> uid  mid  features   label
> 1235231[0, 1, 3, ...]True
>
> Both  "features" and "label" columns are used for GBTClassifier.
>
> However, when I read the file:
> Dataset samples = sparkSession.read().csv(file);
> The type of samples.select("features") is String.
>
> My question is:
> How to map samples.select("features") to Vector or any appropriate type,
> so I can use it to train like:
> GBTClassifier gbdt = new GBTClassifier()
> .setLabelCol("label")
> .setFeaturesCol("features")
> .setMaxIter(2)
> .setMaxDepth(7);
>
> Thanks.
>