Re: udf forces usage of Row for complex types?

2016-09-26 Thread Koert Kuipers
https://issues.apache.org/jira/browse/SPARK-17668

On Mon, Sep 26, 2016 at 3:40 PM, Koert Kuipers  wrote:

> ok will create jira
>
> On Mon, Sep 26, 2016 at 3:27 PM, Michael Armbrust 
> wrote:
>
>> I agree this should work.  We just haven't finished killing the old
>> reflection based conversion logic now that we have more powerful/efficient
>> encoders.  Please open a JIRA.
>>
>> On Sun, Sep 25, 2016 at 2:41 PM, Koert Kuipers  wrote:
>>
>>> after having gotten used to have case classes represent complex
>>> structures in Datasets, i am surprised to find out that when i work in
>>> DataFrames with udfs no such magic exists, and i have to fall back to
>>> manipulating Row objects, which is error prone and somewhat ugly.
>>>
>>> for example:
>>> case class Person(name: String, age: Int)
>>>
>>> val df = Seq((Person("john", 33), 5), (Person("mike", 30),
>>> 6)).toDF("person", "id")
>>> val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age =
>>> p.age + 1) }).apply(col("person")))
>>> df1.printSchema
>>> df1.show
>>>
>>> leads to:
>>> java.lang.ClassCastException: org.apache.spark.sql.catalyst.
>>> expressions.GenericRowWithSchema cannot be cast to Person
>>>
>>
>>
>


Re: udf forces usage of Row for complex types?

2016-09-26 Thread Koert Kuipers
ok will create jira

On Mon, Sep 26, 2016 at 3:27 PM, Michael Armbrust 
wrote:

> I agree this should work.  We just haven't finished killing the old
> reflection based conversion logic now that we have more powerful/efficient
> encoders.  Please open a JIRA.
>
> On Sun, Sep 25, 2016 at 2:41 PM, Koert Kuipers  wrote:
>
>> after having gotten used to have case classes represent complex
>> structures in Datasets, i am surprised to find out that when i work in
>> DataFrames with udfs no such magic exists, and i have to fall back to
>> manipulating Row objects, which is error prone and somewhat ugly.
>>
>> for example:
>> case class Person(name: String, age: Int)
>>
>> val df = Seq((Person("john", 33), 5), (Person("mike", 30),
>> 6)).toDF("person", "id")
>> val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age = p.age
>> + 1) }).apply(col("person")))
>> df1.printSchema
>> df1.show
>>
>> leads to:
>> java.lang.ClassCastException: org.apache.spark.sql.catalyst.
>> expressions.GenericRowWithSchema cannot be cast to Person
>>
>
>


Re: udf forces usage of Row for complex types?

2016-09-26 Thread Michael Armbrust
I agree this should work.  We just haven't finished killing the old
reflection based conversion logic now that we have more powerful/efficient
encoders.  Please open a JIRA.

On Sun, Sep 25, 2016 at 2:41 PM, Koert Kuipers  wrote:

> after having gotten used to have case classes represent complex structures
> in Datasets, i am surprised to find out that when i work in DataFrames with
> udfs no such magic exists, and i have to fall back to manipulating Row
> objects, which is error prone and somewhat ugly.
>
> for example:
> case class Person(name: String, age: Int)
>
> val df = Seq((Person("john", 33), 5), (Person("mike", 30),
> 6)).toDF("person", "id")
> val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age = p.age
> + 1) }).apply(col("person")))
> df1.printSchema
> df1.show
>
> leads to:
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
> cannot be cast to Person
>


Re: udf forces usage of Row for complex types?

2016-09-26 Thread Koert Kuipers
Case classes are serializable by default (they extend java Serializable
trait)

I am not using RDD or Dataset because I need to transform one column out of
200 or so.

Dataset has the mechanisms to convert rows to case classes as needed (and
make sure it's consistent with the schema). Why would this code not be used
to make udfs a lot nicer?

On Sep 26, 2016 1:16 AM, "Bedrytski Aliaksandr"  wrote:

> Hi Koert,
>
> these case classes you are talking about, should be serializeable to be
> efficient (like kryo or just plain java serialization).
>
> DataFrame is not simply a collection of Rows (which are serializeable by
> default), it also contains a schema with different type for each column.
> This way any columnar data may be represented without creating custom case
> classes each time.
>
> If you want to manipulate a collection of case classes, why not use good
> old RDDs? (Or DataSets if you are using Spark 2.0)
> If you want to use sql against that collection, you will need to explain
> to your application how to read it as a table (by transforming it to a
> DataFrame)
>
> Regards
> --
>   Bedrytski Aliaksandr
>   sp...@bedryt.ski
>
>
>
> On Sun, Sep 25, 2016, at 23:41, Koert Kuipers wrote:
>
> after having gotten used to have case classes represent complex structures
> in Datasets, i am surprised to find out that when i work in DataFrames with
> udfs no such magic exists, and i have to fall back to manipulating Row
> objects, which is error prone and somewhat ugly.
> for example:
> case class Person(name: String, age: Int)
>
> val df = Seq((Person("john", 33), 5), (Person("mike", 30),
> 6)).toDF("person", "id")
> val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age = p.age
> + 1) }).apply(col("person")))
> df1.printSchema
> df1.show
> leads to:
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
> cannot be cast to Person
>
>
>


RE: udf forces usage of Row for complex types?

2016-09-26 Thread ming.he
It should be UserDefinedType.

You can refer to 
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala

From: Koert Kuipers [mailto:ko...@tresata.com]
Sent: Monday, September 26, 2016 5:42 AM
To: user@spark.apache.org
Subject: udf forces usage of Row for complex types?

after having gotten used to have case classes represent complex structures in 
Datasets, i am surprised to find out that when i work in DataFrames with udfs 
no such magic exists, and i have to fall back to manipulating Row objects, 
which is error prone and somewhat ugly.
for example:
case class Person(name: String, age: Int)

val df = Seq((Person("john", 33), 5), (Person("mike", 30), 6)).toDF("person", 
"id")
val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age = p.age + 1) 
}).apply(col("person")))
df1.printSchema
df1.show
leads to:
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
to Person


Re: udf forces usage of Row for complex types?

2016-09-25 Thread Bedrytski Aliaksandr
Hi Koert,

these case classes you are talking about, should be serializeable to be
efficient (like kryo or just plain java serialization).

DataFrame is not simply a collection of Rows (which are serializeable by
default), it also contains a schema with different type for each column.
This way any columnar data may be represented without creating custom
case classes each time.

If you want to manipulate a collection of case classes, why not use good
old RDDs? (Or DataSets if you are using Spark 2.0)
If you want to use sql against that collection, you will need to explain
to your application how to read it as a table (by transforming it to a
DataFrame)

Regards
--
  Bedrytski Aliaksandr
  sp...@bedryt.ski



On Sun, Sep 25, 2016, at 23:41, Koert Kuipers wrote:
> after having gotten used to have case classes represent complex
> structures in Datasets, i am surprised to find out that when i work in
> DataFrames with udfs no such magic exists, and i have to fall back to
> manipulating Row objects, which is error prone and somewhat ugly.
> for example:
> case class Person(name: String, age: Int)
>
> val df = Seq((Person("john", 33), 5), (Person("mike", 30),
> 6)).toDF("person", "id")
> val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age =
> p.age + 1) }).apply(col("person")))
> df1.printSchema
> df1.show
> leads to:
> java.lang.ClassCastException:
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot
> be cast to Person


udf forces usage of Row for complex types?

2016-09-25 Thread Koert Kuipers
after having gotten used to have case classes represent complex structures
in Datasets, i am surprised to find out that when i work in DataFrames with
udfs no such magic exists, and i have to fall back to manipulating Row
objects, which is error prone and somewhat ugly.

for example:
case class Person(name: String, age: Int)

val df = Seq((Person("john", 33), 5), (Person("mike", 30),
6)).toDF("person", "id")
val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age = p.age +
1) }).apply(col("person")))
df1.printSchema
df1.show

leads to:
java.lang.ClassCastException:
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be
cast to Person