I'm sure there's another way to do it; I hope someone can show us.  I
couldn't figure out how to use `map` either.

On Wed, Sep 21, 2016 at 3:32 AM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote:

> Thanks, Peter.
> It works!
>
> Why udf is needed?
>
>
>
>
> On Wed, Sep 21, 2016 at 12:00 AM, Peter Figliozzi <
> pete.figlio...@gmail.com> wrote:
>
>> Hi Yan, I agree, it IS really confusing.  Here is the technique for
>> transforming a column.  It is very general because you can make "myConvert"
>> do whatever you want.
>>
>> import org.apache.spark.mllib.linalg.Vectors
>> val df = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF
>>
>> df.show()
>> // The columns were named "_1" and "_2"
>> // Very confusing, because it looks like a Scala wildcard when we refer
>> to it in code
>>
>> val myConvert = (x: String) => { Vectors.parse(x) }
>> val myConvertUDF = udf(myConvert)
>>
>> val newDf = df.withColumn("parsed", myConvertUDF(col("_2")))
>>
>> newDf.show()
>>
>> On Mon, Sep 19, 2016 at 3:29 AM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote:
>>
>>> Hi, all.
>>> I find that it's really confuse.
>>>
>>> I can use Vectors.parse to create a DataFrame contains Vector type.
>>>
>>>     scala> val dataVec = Seq((0, Vectors.parse("[1,3,5]")), (1,
>>> Vectors.parse("[2,4,6]"))).toDF
>>>     dataVec: org.apache.spark.sql.DataFrame = [_1: int, _2: vector]
>>>
>>>
>>> But using map to convert String to Vector throws an error:
>>>
>>>     scala> val dataStr = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF
>>>     dataStr: org.apache.spark.sql.DataFrame = [_1: int, _2: string]
>>>
>>>     scala> dataStr.map(row => Vectors.parse(row.getString(1)))
>>>     <console>:30: error: Unable to find encoder for type stored in a
>>> Dataset.  Primitive types (Int, String, etc) and Product types (case
>>> classes) are supported by importing spark.implicits._  Support for
>>> serializing other types will be added in future releases.
>>>       dataStr.map(row => Vectors.parse(row.getString(1)))
>>>
>>>
>>> Dose anyone can help me,
>>> thanks very much!
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi <
>>> pete.figlio...@gmail.com> wrote:
>>>
>>>> Hi Yan, I think you'll have to map the features column to a new
>>>> numerical features column.
>>>>
>>>> Here's one way to do the individual transform:
>>>>
>>>> scala> val x = "[1, 2, 3, 4, 5]"
>>>> x: String = [1, 2, 3, 4, 5]
>>>>
>>>> scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "")
>>>> split(" ") map(_.toInt)
>>>> y: Array[Int] = Array(1, 2, 3, 4, 5)
>>>>
>>>> If you don't know about the Scala command line, just type "scala" in a
>>>> terminal window.  It's a good place to try things out.
>>>>
>>>> You can make a function out of this transformation and apply it to your
>>>> features column to make a new column.  Then add this with
>>>> Dataset.withColumn.
>>>>
>>>> See here
>>>> <http://stackoverflow.com/questions/35227568/applying-function-to-spark-dataframe-column>
>>>> on how to apply a function to a Column to make a new column.
>>>>
>>>> On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai) <yaf...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> I have a csv file like:
>>>>> uid      mid      features       label
>>>>> 123    5231    [0, 1, 3, ...]    True
>>>>>
>>>>> Both  "features" and "label" columns are used for GBTClassifier.
>>>>>
>>>>> However, when I read the file:
>>>>> Dataset<Row> samples = sparkSession.read().csv(file);
>>>>> The type of samples.select("features") is String.
>>>>>
>>>>> My question is:
>>>>> How to map samples.select("features") to Vector or any appropriate
>>>>> type,
>>>>> so I can use it to train like:
>>>>>         GBTClassifier gbdt = new GBTClassifier()
>>>>>                 .setLabelCol("label")
>>>>>                 .setFeaturesCol("features")
>>>>>                 .setMaxIter(2)
>>>>>                 .setMaxDepth(7);
>>>>>
>>>>> Thanks.
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to