Hi Yan, I think you'll have to map the features column to a new numerical
features column.

Here's one way to do the individual transform:

scala> val x = "[1, 2, 3, 4, 5]"
x: String = [1, 2, 3, 4, 5]

scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "") split("
") map(_.toInt)
y: Array[Int] = Array(1, 2, 3, 4, 5)

If you don't know about the Scala command line, just type "scala" in a
terminal window.  It's a good place to try things out.

You can make a function out of this transformation and apply it to your
features column to make a new column.  Then add this with
Dataset.withColumn.

See here
<http://stackoverflow.com/questions/35227568/applying-function-to-spark-dataframe-column>
on how to apply a function to a Column to make a new column.

On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote:

> Hi,
> I have a csv file like:
> uid      mid      features       label
> 123    5231    [0, 1, 3, ...]    True
>
> Both  "features" and "label" columns are used for GBTClassifier.
>
> However, when I read the file:
> Dataset<Row> samples = sparkSession.read().csv(file);
> The type of samples.select("features") is String.
>
> My question is:
> How to map samples.select("features") to Vector or any appropriate type,
> so I can use it to train like:
>         GBTClassifier gbdt = new GBTClassifier()
>                 .setLabelCol("label")
>                 .setFeaturesCol("features")
>                 .setMaxIter(2)
>                 .setMaxDepth(7);
>
> Thanks.
>

Reply via email to