Hi
what is your source data? i am guessing a DataFrame or Integers as you
are usingan UDF....
So your DataFrame is then a bunch of Row[Integer] ?
below a sample from one of my code to predict eurocup winners , going from
a DataFrame of Row[Double] to a RDD of LabeledPoint
I m not using UDF to convert to a Vector, i never tried ........anyone can
suggest a better way as i dont htink my approach is very good
hth
val euroQualifierDataFrame = getDataSet(sqlContext, trainDataPath) // this
is a DataFrame of Row[Double]
val vectorRdd = euroQualifierDataFrame.map(createVectorRDD) // to an RDD of
Seq[Double], the row is now a sequence of Doubles
val data = toLabeledPointsRDD(vectorRdd, 0) // the second parameter is to
identify which item in the Seq[Double] is the Label. output will be an
RDD[LabeledPoint]
def createVectorRDD(row:Row):Seq[Double] = {
row.toSeq.map(_.asInstanceOf[Number].doubleValue)
}
def createLabeledPoint(row:Seq[Double], targetFeatureIdx:Int) = {
val features = row.zipWithIndex.filter(tpl => tpl._2 !=
targetFeatureIdx).map(tpl => tpl._1)
val main = row(targetFeatureIdx)
LabeledPoint(main, Vectors.dense(features.toArray))
}
def toLabeledPointsRDD(rddData: RDD[Seq[Double]], targetFeatureIdx:Int) =
{
rddData.map(seq => createLabeledPoint(seq, targetFeatureIdx))
}
On Sun, Jul 24, 2016 at 5:12 PM, Jean Georges Perrin <[email protected]> wrote:
>
> Hi,
>
> Here is my UDF that should build a VectorUDT. How do I actually make that
> the value is in the vector?
>
> package net.jgp.labs.spark.udf;
>
> import org.apache.spark.mllib.linalg.VectorUDT;
> import org.apache.spark.sql.api.java.UDF1;
>
> public class VectorBuilder implements UDF1<Integer, VectorUDT> {
> private static final long serialVersionUID = -2991355883253063841L;
>
> @Override
> public VectorUDT call(Integer t1) throws Exception {
> return new VectorUDT();
> }
>
> }
>
> i plan on having this used by a linear regression in ML...
>