Hi Peter,
I'm familiar with Pandas / Numpy in python,  while spark / scala is totally
new for me.
Pandas provides a detailed document, like how to slice data, parse file,
use apply and filter function.

Do spark have some more detailed document?



On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi <pete.figlio...@gmail.com>
wrote:

> Hi Yan, I think you'll have to map the features column to a new numerical
> features column.
>
> Here's one way to do the individual transform:
>
> scala> val x = "[1, 2, 3, 4, 5]"
> x: String = [1, 2, 3, 4, 5]
>
> scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "")
> split(" ") map(_.toInt)
> y: Array[Int] = Array(1, 2, 3, 4, 5)
>
> If you don't know about the Scala command line, just type "scala" in a
> terminal window.  It's a good place to try things out.
>
> You can make a function out of this transformation and apply it to your
> features column to make a new column.  Then add this with
> Dataset.withColumn.
>
> See here
> <http://stackoverflow.com/questions/35227568/applying-function-to-spark-dataframe-column>
> on how to apply a function to a Column to make a new column.
>
> On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote:
>
>> Hi,
>> I have a csv file like:
>> uid      mid      features       label
>> 123    5231    [0, 1, 3, ...]    True
>>
>> Both  "features" and "label" columns are used for GBTClassifier.
>>
>> However, when I read the file:
>> Dataset<Row> samples = sparkSession.read().csv(file);
>> The type of samples.select("features") is String.
>>
>> My question is:
>> How to map samples.select("features") to Vector or any appropriate type,
>> so I can use it to train like:
>>         GBTClassifier gbdt = new GBTClassifier()
>>                 .setLabelCol("label")
>>                 .setFeaturesCol("features")
>>                 .setMaxIter(2)
>>                 .setMaxDepth(7);
>>
>> Thanks.
>>
>
>

Reply via email to