Re: feeding DataFrames into predictive algorithms

2015-02-17 Thread Xiangrui Meng
Hey Sandy,

The work should be done by a VectorAssembler, which combines multiple
columns (double/int/vector) into a vector column, which becomes the
features column for regression. We can going to create JIRAs for each
of these standard feature transformers. It would be great if you can
help implement some of them.

Best,
Xiangrui

On Wed, Feb 11, 2015 at 7:55 PM, Patrick Wendell pwend...@gmail.com wrote:
 I think there is a minor error here in that the first example needs a
 tail after the seq:

 df.map { row =
   (row.getDouble(0), row.toSeq.tail.map(_.asInstanceOf[Double]))
 }.toDataFrame(label, features)

 On Wed, Feb 11, 2015 at 7:46 PM, Michael Armbrust
 mich...@databricks.com wrote:
 It sounds like you probably want to do a standard Spark map, that results in
 a tuple with the structure you are looking for.  You can then just assign
 names to turn it back into a dataframe.

 Assuming the first column is your label and the rest are features you can do
 something like this:

 val df = sc.parallelize(
   (1.0, 2.3, 2.4) ::
   (1.2, 3.4, 1.2) ::
   (1.2, 2.3, 1.2) :: Nil).toDataFrame(a, b, c)

 df.map { row =
   (row.getDouble(0), row.toSeq.map(_.asInstanceOf[Double]))
 }.toDataFrame(label, features)

 df: org.apache.spark.sql.DataFrame = [label: double, features:
 arraydouble]

 If you'd prefer to stick closer to SQL you can define a UDF:

 val createArray = udf((a: Double, b: Double) = Seq(a, b))
 df.select('a as 'label, createArray('b,'c) as 'features)

 df: org.apache.spark.sql.DataFrame = [label: double, features:
 arraydouble]

 We'll add createArray as a first class member of the DSL.

 Michael

 On Wed, Feb 11, 2015 at 6:37 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Hey All,

 I've been playing around with the new DataFrame and ML pipelines APIs and
 am having trouble accomplishing what seems like should be a fairly basic
 task.

 I have a DataFrame where each column is a Double.  I'd like to turn this
 into a DataFrame with a features column and a label column that I can feed
 into a regression.

 So far all the paths I've gone down have led me to internal APIs or
 convoluted casting in and out of RDD[Row] and DataFrame.  Is there a simple
 way of accomplishing this?

 any assistance (lookin' at you Xiangrui) much appreciated,
 Sandy



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



feeding DataFrames into predictive algorithms

2015-02-11 Thread Sandy Ryza
Hey All,

I've been playing around with the new DataFrame and ML pipelines APIs and
am having trouble accomplishing what seems like should be a fairly basic
task.

I have a DataFrame where each column is a Double.  I'd like to turn this
into a DataFrame with a features column and a label column that I can feed
into a regression.

So far all the paths I've gone down have led me to internal APIs or
convoluted casting in and out of RDD[Row] and DataFrame.  Is there a simple
way of accomplishing this?

any assistance (lookin' at you Xiangrui) much appreciated,
Sandy


Re: feeding DataFrames into predictive algorithms

2015-02-11 Thread Michael Armbrust
It sounds like you probably want to do a standard Spark map, that results
in a tuple with the structure you are looking for.  You can then just
assign names to turn it back into a dataframe.

Assuming the first column is your label and the rest are features you can
do something like this:

val df = sc.parallelize(
  (1.0, 2.3, 2.4) ::
  (1.2, 3.4, 1.2) ::
  (1.2, 2.3, 1.2) :: Nil).toDataFrame(a, b, c)

df.map { row =
  (row.getDouble(0), row.toSeq.map(_.asInstanceOf[Double]))
}.toDataFrame(label, features)

df: org.apache.spark.sql.DataFrame = [label: double, features:
arraydouble]

If you'd prefer to stick closer to SQL you can define a UDF:

val createArray = udf((a: Double, b: Double) = Seq(a, b))
df.select('a as 'label, createArray('b,'c) as 'features)

df: org.apache.spark.sql.DataFrame = [label: double, features:
arraydouble]

We'll add createArray as a first class member of the DSL.

Michael

On Wed, Feb 11, 2015 at 6:37 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Hey All,

 I've been playing around with the new DataFrame and ML pipelines APIs and
 am having trouble accomplishing what seems like should be a fairly basic
 task.

 I have a DataFrame where each column is a Double.  I'd like to turn this
 into a DataFrame with a features column and a label column that I can feed
 into a regression.

 So far all the paths I've gone down have led me to internal APIs or
 convoluted casting in and out of RDD[Row] and DataFrame.  Is there a simple
 way of accomplishing this?

 any assistance (lookin' at you Xiangrui) much appreciated,
 Sandy



Re: feeding DataFrames into predictive algorithms

2015-02-11 Thread Patrick Wendell
I think there is a minor error here in that the first example needs a
tail after the seq:

df.map { row =
  (row.getDouble(0), row.toSeq.tail.map(_.asInstanceOf[Double]))
}.toDataFrame(label, features)

On Wed, Feb 11, 2015 at 7:46 PM, Michael Armbrust
mich...@databricks.com wrote:
 It sounds like you probably want to do a standard Spark map, that results in
 a tuple with the structure you are looking for.  You can then just assign
 names to turn it back into a dataframe.

 Assuming the first column is your label and the rest are features you can do
 something like this:

 val df = sc.parallelize(
   (1.0, 2.3, 2.4) ::
   (1.2, 3.4, 1.2) ::
   (1.2, 2.3, 1.2) :: Nil).toDataFrame(a, b, c)

 df.map { row =
   (row.getDouble(0), row.toSeq.map(_.asInstanceOf[Double]))
 }.toDataFrame(label, features)

 df: org.apache.spark.sql.DataFrame = [label: double, features:
 arraydouble]

 If you'd prefer to stick closer to SQL you can define a UDF:

 val createArray = udf((a: Double, b: Double) = Seq(a, b))
 df.select('a as 'label, createArray('b,'c) as 'features)

 df: org.apache.spark.sql.DataFrame = [label: double, features:
 arraydouble]

 We'll add createArray as a first class member of the DSL.

 Michael

 On Wed, Feb 11, 2015 at 6:37 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Hey All,

 I've been playing around with the new DataFrame and ML pipelines APIs and
 am having trouble accomplishing what seems like should be a fairly basic
 task.

 I have a DataFrame where each column is a Double.  I'd like to turn this
 into a DataFrame with a features column and a label column that I can feed
 into a regression.

 So far all the paths I've gone down have led me to internal APIs or
 convoluted casting in and out of RDD[Row] and DataFrame.  Is there a simple
 way of accomplishing this?

 any assistance (lookin' at you Xiangrui) much appreciated,
 Sandy



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org