I'm sure there's another way to do it; I hope someone can show us. I couldn't figure out how to use `map` either.
On Wed, Sep 21, 2016 at 3:32 AM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote: > Thanks, Peter. > It works! > > Why udf is needed? > > > > > On Wed, Sep 21, 2016 at 12:00 AM, Peter Figliozzi < > pete.figlio...@gmail.com> wrote: > >> Hi Yan, I agree, it IS really confusing. Here is the technique for >> transforming a column. It is very general because you can make "myConvert" >> do whatever you want. >> >> import org.apache.spark.mllib.linalg.Vectors >> val df = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF >> >> df.show() >> // The columns were named "_1" and "_2" >> // Very confusing, because it looks like a Scala wildcard when we refer >> to it in code >> >> val myConvert = (x: String) => { Vectors.parse(x) } >> val myConvertUDF = udf(myConvert) >> >> val newDf = df.withColumn("parsed", myConvertUDF(col("_2"))) >> >> newDf.show() >> >> On Mon, Sep 19, 2016 at 3:29 AM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote: >> >>> Hi, all. >>> I find that it's really confuse. >>> >>> I can use Vectors.parse to create a DataFrame contains Vector type. >>> >>> scala> val dataVec = Seq((0, Vectors.parse("[1,3,5]")), (1, >>> Vectors.parse("[2,4,6]"))).toDF >>> dataVec: org.apache.spark.sql.DataFrame = [_1: int, _2: vector] >>> >>> >>> But using map to convert String to Vector throws an error: >>> >>> scala> val dataStr = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF >>> dataStr: org.apache.spark.sql.DataFrame = [_1: int, _2: string] >>> >>> scala> dataStr.map(row => Vectors.parse(row.getString(1))) >>> <console>:30: error: Unable to find encoder for type stored in a >>> Dataset. Primitive types (Int, String, etc) and Product types (case >>> classes) are supported by importing spark.implicits._ Support for >>> serializing other types will be added in future releases. >>> dataStr.map(row => Vectors.parse(row.getString(1))) >>> >>> >>> Dose anyone can help me, >>> thanks very much! >>> >>> >>> >>> >>> >>> >>> >>> On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi < >>> pete.figlio...@gmail.com> wrote: >>> >>>> Hi Yan, I think you'll have to map the features column to a new >>>> numerical features column. >>>> >>>> Here's one way to do the individual transform: >>>> >>>> scala> val x = "[1, 2, 3, 4, 5]" >>>> x: String = [1, 2, 3, 4, 5] >>>> >>>> scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "") >>>> split(" ") map(_.toInt) >>>> y: Array[Int] = Array(1, 2, 3, 4, 5) >>>> >>>> If you don't know about the Scala command line, just type "scala" in a >>>> terminal window. It's a good place to try things out. >>>> >>>> You can make a function out of this transformation and apply it to your >>>> features column to make a new column. Then add this with >>>> Dataset.withColumn. >>>> >>>> See here >>>> <http://stackoverflow.com/questions/35227568/applying-function-to-spark-dataframe-column> >>>> on how to apply a function to a Column to make a new column. >>>> >>>> On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai) <yaf...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> I have a csv file like: >>>>> uid mid features label >>>>> 123 5231 [0, 1, 3, ...] True >>>>> >>>>> Both "features" and "label" columns are used for GBTClassifier. >>>>> >>>>> However, when I read the file: >>>>> Dataset<Row> samples = sparkSession.read().csv(file); >>>>> The type of samples.select("features") is String. >>>>> >>>>> My question is: >>>>> How to map samples.select("features") to Vector or any appropriate >>>>> type, >>>>> so I can use it to train like: >>>>> GBTClassifier gbdt = new GBTClassifier() >>>>> .setLabelCol("label") >>>>> .setFeaturesCol("features") >>>>> .setMaxIter(2) >>>>> .setMaxDepth(7); >>>>> >>>>> Thanks. >>>>> >>>> >>>> >>> >> >