from:"Simeon H.K. Fitch \(JIRA\)"

[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2021-02-22 Thread Simeon H.K. Fitch (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17288388#comment-17288388
 ] 

Simeon H.K. Fitch commented on SPARK-7768:
--

[~srowen]Thanks so much for for all your excellent work on this!

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Assignee: Sean R. Owen
>Priority: Critical
> Fix For: 3.2.0
>
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2021-01-22 Thread Simeon H.K. Fitch (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17270256#comment-17270256
 ] 

Simeon H.K. Fitch commented on SPARK-7768:
--

[~xkrogen] ‍♂️

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2021-01-22 Thread Simeon H.K. Fitch (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17270148#comment-17270148
 ] 

Simeon H.K. Fitch commented on SPARK-7768:
--

Not being familiar with the Spark issue triage and prioritization process, I'm 
wondering if there are other forums by which we might advocate for low level or 
extensibility features such as this. A chat session? A video meeting? Something 
else? 

Up to now RasterFrames and GeoMesa have been using backdoor techniques to work 
around the UDT restriction. However, this is no longer viable in modern (i.e. > 
8) versions of the JVM with the new package/modules restrictions. Our users are 
eager for us to support > 8, but this ticket has left us in a bind as to how to 
serve them. I'd hate to see RasterFrames and GeoMesa (and others) die on the 
vine because we can no longer keep up with the JVM ecosystem.

(Note: while I have been programming against the internals of Spark for over 4 
years (c.f. RasterFrames), yet–likely due to my own fault–the feature advocacy 
process is still opaque to me. If there's another way, I'd appreciate some kind 
person pointing me in the right direction).

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method

2019-10-02 Thread Simeon H.K. Fitch (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942906#comment-16942906
 ] 

Simeon H.K. Fitch commented on SPARK-13802:
---

Is there a workaround to this problem? Ordering is important when encoders are 
used to reify structs into Scala types, and not being able to specify the order 
(without a lot of boilerplate schema work) results in Exceptions.

> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -
>
> Key: SPARK-13802
> URL: https://issues.apache.org/jira/browse/SPARK-13802
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Szymon Matejczyk
>Priority: Major
>
> When using Row constructor from kwargs, fields in the tuple underneath are 
> sorted by name. When Schema is reading the row, it is not using the fields in 
> this order.
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
> StructField("id", StringType()),
> StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +--+--+
> |id|first_name|
> +--+--+
> |Szymon|39|
> +--+--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12823) Cannot create UDF with StructType input

2019-05-21 Thread Simeon H.K. Fitch (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844802#comment-16844802
 ] 

Simeon H.K. Fitch commented on SPARK-12823:
---

Hi [~hyukjin.kwon], is there a commit associated with this resolution?

> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>Priority: Major
>  Labels: bulk-closed
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a {{UserDefinedFunction}} manually, not using the {{udf}} 
> helper functions, it works. See https://github.com/FRosner/struct-udf, which 
> exposes the {{UserDefinedFunction}} constructor (public from package 
> private). However, then you have to work with a {{Row}}, because it does not 
> automatically convert the row to a case class / tuple.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23696) StructType.fromString swallows exceptions from DataType.fromJson

2019-05-21 Thread Simeon H.K. Fitch (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844808#comment-16844808
 ] 

Simeon H.K. Fitch commented on SPARK-23696:
---

[~hyukjin.kwon] is there a commit associated with this resolution?

> StructType.fromString swallows exceptions from DataType.fromJson
> 
>
> Key: SPARK-23696
> URL: https://issues.apache.org/jira/browse/SPARK-23696
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Simeon H.K. Fitch
>Priority: Trivial
>  Labels: bulk-closed
>
> `StructType.fromString` swallows exceptions from `DataType.fromJson`, 
> assuming they are an indication that the `LegacyTypeStringParser.parse` 
> should be called instead. When that fails (because it throws an excreption), 
> an error message is generated that does not reflect the true problem at hand, 
> effectively swallowing the exception from `DataType.fromJson`. This makes 
> debugging Parquet schema issues more difficult.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7768) Make user-defined type (UDT) API public

2018-08-22 Thread Simeon H.K. Fitch (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588911#comment-16588911
 ] 

Simeon H.K. Fitch edited comment on SPARK-7768 at 8/22/18 2:25 PM:
---

[Here's a 
synopsis|https://issues.apache.org/jira/browse/SPARK-12823?focusedCommentId=16171627=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16171627]
 of the issues with Encoder asymmetry that makes UDTs even more necessary.


was (Author: metasim):
Here's a synopsis of the issues with Encoder asymmetry that makes UDTs even 
more necessary.

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2018-08-22 Thread Simeon H.K. Fitch (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588911#comment-16588911
 ] 

Simeon H.K. Fitch commented on SPARK-7768:
--

Here's a synopsis of the issues with Encoder asymmetry that makes UDTs even 
more necessary.

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7768) Make user-defined type (UDT) API public

2018-08-22 Thread Simeon H.K. Fitch (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588906#comment-16588906
 ] 

Simeon H.K. Fitch edited comment on SPARK-7768 at 8/22/18 2:21 PM:
---

We use UDTs to great effect in [RasterFrames|http://rasterframes.io/] through a 
[very small 
hole|https://github.com/locationtech/rasterframes/blob/79b2c3129a2482055ae14d4fd9cf3693c425ece6/core/src/main/scala/org/apache/spark/sql/gt/types/TileUDT.scala#L33]
 in the private API. So from my perspective, basically all that needs to be 
changed is to [remove this line of 
code|https://github.com/apache/spark/blob/3323b156f9c0beb0b3c2b724a6faddc6ffdfe99a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala#L41].

I'm speaking of UDTs specifically; we have custom encoders too, which are by no 
means easy to implement. Unfortunately `Encoder` encoded types don't currently 
work with UDFs in the same way that UDTs do.


was (Author: metasim):
We use UDTs to great effect in [RasterFrames|http://rasterframes.io]  through a 
[very small 
hole|https://github.com/locationtech/rasterframes/blob/79b2c3129a2482055ae14d4fd9cf3693c425ece6/core/src/main/scala/org/apache/spark/sql/gt/types/TileUDT.scala#L33]
 in the private API. So from my perspective, basically all that needs to be 
changed is to [remove this line of 
code|https://github.com/apache/spark/blob/3323b156f9c0beb0b3c2b724a6faddc6ffdfe99a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala#L41].

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2018-08-22 Thread Simeon H.K. Fitch (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588906#comment-16588906
 ] 

Simeon H.K. Fitch commented on SPARK-7768:
--

We use UDTs to great effect in [RasterFrames|http://rasterframes.io]  through a 
[very small 
hole|https://github.com/locationtech/rasterframes/blob/79b2c3129a2482055ae14d4fd9cf3693c425ece6/core/src/main/scala/org/apache/spark/sql/gt/types/TileUDT.scala#L33]
 in the private API. So from my perspective, basically all that needs to be 
changed is to [remove this line of 
code|https://github.com/apache/spark/blob/3323b156f9c0beb0b3c2b724a6faddc6ffdfe99a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala#L41].

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14540) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner

2018-08-02 Thread Simeon H.K. Fitch (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566948#comment-16566948
 ] 

Simeon H.K. Fitch commented on SPARK-14540:
---

Congratulations! A long, difficult haul... Cheers all around!

> Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
> 
>
> Key: SPARK-14540
> URL: https://issues.apache.org/jira/browse/SPARK-14540
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Stavros Kontopoulos
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> Using https://github.com/JoshRosen/spark/tree/build-for-2.12, I tried running 
> ClosureCleanerSuite with Scala 2.12 and ran into two bad test failures:
> {code}
> [info] - toplevel return statements in closures are identified at cleaning 
> time *** FAILED *** (32 milliseconds)
> [info]   Expected exception 
> org.apache.spark.util.ReturnStatementInClosureException to be thrown, but no 
> exception was thrown. (ClosureCleanerSuite.scala:57)
> {code}
> and
> {code}
> [info] - user provided closures are actually cleaned *** FAILED *** (56 
> milliseconds)
> [info]   Expected ReturnStatementInClosureException, but got 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task not 
> serializable: java.io.NotSerializableException: java.lang.Object
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class 
> org.apache.spark.util.TestUserClosuresActuallyCleaned$, 
> functionalInterfaceMethod=scala/runtime/java8/JFunction1$mcII$sp.apply$mcII$sp:(I)I,
>  implementation=invokeStatic 
> org/apache/spark/util/TestUserClosuresActuallyCleaned$.org$apache$spark$util$TestUserClosuresActuallyCleaned$$$anonfun$69:(Ljava/lang/Object;I)I,
>  instantiatedMethodType=(I)I, numCaptured=1])
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, 
> functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> org/apache/spark/rdd/RDD.org$apache$spark$rdd$RDD$$$anonfun$20$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  
> instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  numCaptured=1])
> [info]- field (class "org.apache.spark.rdd.MapPartitionsRDD", name: 
> "f", type: "interface scala.Function3")
> [info]- object (class "org.apache.spark.rdd.MapPartitionsRDD", 
> MapPartitionsRDD[2] at apply at Transformer.scala:22)
> [info]- field (class "scala.Tuple2", name: "_1", type: "class 
> java.lang.Object")
> [info]- root object (class "scala.Tuple2", (MapPartitionsRDD[2] at 
> apply at 
> Transformer.scala:22,org.apache.spark.SparkContext$$Lambda$957/431842435@6e803685)).
> [info]   This means the closure provided by user is not actually cleaned. 
> (ClosureCleanerSuite.scala:78)
> {code}
> We'll need to figure out a closure cleaning strategy which works for 2.12 
> lambdas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2018-08-02 Thread Simeon H.K. Fitch (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566945#comment-16566945
 ] 

Simeon H.K. Fitch commented on SPARK-14220:
---

(flag)(*)(*r)(*g)(*b)(*y):D(*y)(*b)(*g)(*r)(*)(flag)

Way to go! This is amazing.

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-14220) Build and test Spark against Scala 2.12

2018-08-02 Thread Simeon H.K. Fitch (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon H.K. Fitch updated SPARK-14220:
--
Comment: was deleted

(was: (flag)(*)(*r)(*g)(*b)(*y):D(*y)(*b)(*g)(*r)(*)(flag)

Way to go! This is amazing.)

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2018-08-02 Thread Simeon H.K. Fitch (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566942#comment-16566942
 ] 

Simeon H.K. Fitch commented on SPARK-14220:
---

(flag)(*)(*r)(*g)(*b)(*y):D(*y)(*b)(*g)(*r)(*)(flag)

Way to go! This is amazing.

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24649) SparkUDF.unapply is not backwards compatable

2018-06-25 Thread Simeon H.K. Fitch (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon H.K. Fitch updated SPARK-24649:
--
Priority: Minor  (was: Major)

> SparkUDF.unapply is not backwards compatable
> 
>
> Key: SPARK-24649
> URL: https://issues.apache.org/jira/browse/SPARK-24649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Simeon H.K. Fitch
>Priority: Minor
>
> The shape of the `ScalaUDF` case class changed in 2.3.0.  A secondary 
> constructor that's backwards compatible with 2.1.x and 2.2.x was provided, 
> but a corresponding `unapply` method wasn't included. Therefore code such as 
> the following that worked in 2.1.x and 2.2.x no longer compiles:
> {code:java}
> val ScalaUDF(function, dataType, children, inputTypes, udfName) = myUDF
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24649) SparkUDF.unapply is not backwards compatable

2018-06-25 Thread Simeon H.K. Fitch (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon H.K. Fitch updated SPARK-24649:
--
Description: 
The shape of the `ScalaUDF` case class changed in 2.3.0.  A secondary 
constructor that's backwards compatible with 2.1.x and 2.2.x was provided, but 
a corresponding `unapply` method wasn't included. Therefore code such as the 
following that worked in 2.1.x and 2.2.x no longer compiles:

{code:java}
val ScalaUDF(function, dataType, children, inputTypes, udfName) = myUDF
{code}


  was:
The shape of the `ScalaUDF` case class changed in 2.3.0.  A secondary 
constructor that's backwards compatible with 2.1.x and 2.2.x was provided, but 
a corresponding `unapply` method wasn't included. Therefore code such as the 
following that worked in 2.1.x and 2.2.x no longer compiles:

{code:java}
val ScalaUDF(function, dataType, children, inputTypes, udfName) = myUDF
{code}

Scala automatically generates an `unapply` method for the primary constructor 
of case classes in the companion object. An appropriate fix for this would to 
manually provide a secondary `unapply` method for the 2.2.x signature.


> SparkUDF.unapply is not backwards compatable
> 
>
> Key: SPARK-24649
> URL: https://issues.apache.org/jira/browse/SPARK-24649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Simeon H.K. Fitch
>Priority: Major
>
> The shape of the `ScalaUDF` case class changed in 2.3.0.  A secondary 
> constructor that's backwards compatible with 2.1.x and 2.2.x was provided, 
> but a corresponding `unapply` method wasn't included. Therefore code such as 
> the following that worked in 2.1.x and 2.2.x no longer compiles:
> {code:java}
> val ScalaUDF(function, dataType, children, inputTypes, udfName) = myUDF
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24649) SparkUDF.unapply is not backwards compatable

2018-06-25 Thread Simeon H.K. Fitch (JIRA)

Simeon H.K. Fitch created SPARK-24649:
-

 Summary: SparkUDF.unapply is not backwards compatable
 Key: SPARK-24649
 URL: https://issues.apache.org/jira/browse/SPARK-24649
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1, 2.3.0
Reporter: Simeon H.K. Fitch


The shape of the `ScalaUDF` case class changed in 2.3.0.  A secondary 
constructor that's backwards compatible with 2.1.x and 2.2.x was provided, but 
a corresponding `unapply` method wasn't included. Therefore code such as the 
following that worked in 2.1.x and 2.2.x no longer compiles:

{code:java}
val ScalaUDF(function, dataType, children, inputTypes, udfName) = myUDF
{code}

Scala automatically generates an `unapply` method for the primary constructor 
of case classes in the companion object. An appropriate fix for this would to 
manually provide a secondary `unapply` method for the 2.2.x signature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12823) Cannot create UDF with StructType input

2018-04-02 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422189#comment-16422189
 ] 

Simeon H.K. Fitch commented on SPARK-12823:
---

[~gbarna]Nice! Thanks for sharing!

> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>Priority: Major
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a {{UserDefinedFunction}} manually, not using the {{udf}} 
> helper functions, it works. See https://github.com/FRosner/struct-udf, which 
> exposes the {{UserDefinedFunction}} constructor (public from package 
> private). However, then you have to work with a {{Row}}, because it does not 
> automatically convert the row to a case class / tuple.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23696) StructType.fromString swallows exceptions from DataType.fromJson

2018-03-16 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401979#comment-16401979
 ] 

Simeon H.K. Fitch commented on SPARK-23696:
---

Not sure what you mean by "we", but I'm just using the Parquet reader, which 
calls `StructType.fromString`; i.e. I don't have a choice.

> StructType.fromString swallows exceptions from DataType.fromJson
> 
>
> Key: SPARK-23696
> URL: https://issues.apache.org/jira/browse/SPARK-23696
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Simeon H.K. Fitch
>Priority: Trivial
>
> `StructType.fromString` swallows exceptions from `DataType.fromJson`, 
> assuming they are an indication that the `LegacyTypeStringParser.parse` 
> should be called instead. When that fails (because it throws an excreption), 
> an error message is generated that does not reflect the true problem at hand, 
> effectively swallowing the exception from `DataType.fromJson`. This makes 
> debugging Parquet schema issues more difficult.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23696) StructType.fromString swallows exceptions from DataType.fromJson

2018-03-15 Thread Simeon H.K. Fitch (JIRA)

Simeon H.K. Fitch created SPARK-23696:
-

 Summary: StructType.fromString swallows exceptions from 
DataType.fromJson
 Key: SPARK-23696
 URL: https://issues.apache.org/jira/browse/SPARK-23696
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.2.1
Reporter: Simeon H.K. Fitch


`StructType.fromString` swallows exceptions from `DataType.fromJson`, assuming 
they are an indication that the `LegacyTypeStringParser.parse` should be called 
instead. When that fails (because it throws an excreption), an error message is 
generated that does not reflect the true problem at hand, effectively 
swallowing the exception from `DataType.fromJson`. This makes debugging Parquet 
schema issues more difficult.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12823) Cannot create UDF with StructType input

2017-09-20 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16173702#comment-16173702
 ] 

Simeon H.K. Fitch edited comment on SPARK-12823 at 9/20/17 7:16 PM:


As a coda, it's interesting to note that you can *return* a `Product` type from 
a UDF, and the conversion works fine:

{code:java}
  val litValue = udf(() ⇒ KV(4l, "four"))

  ds.select(litValue()).show
  //  ++
  //  |   UDF()|
  //  ++
  //  |[4,four]|
  //  |[4,four]|
  //  ++

  println(ds.select(litValue().as[KV]).first)
  // KV(4,four)

{code}

So there's a weird asymmetry to it as well.


was (Author: metasim):
As a coda, it's interesting to note that you can *return* a `Product` type from 
a UDF, and the conversion works fine:

{code:java}
  val litValue = udf(() ⇒ KV(4l, "four"))

  ds.select(litValue()).show
  //  ++
  //  |   UDF()|
  //  ++
  //  |[4,four]|
  //  |[4,four]|
  //  ++
{code}

So there's a weird asymmetry to it as well.

> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a {{UserDefinedFunction}} manually, not using the {{udf}} 
> helper functions, it works. See https://github.com/FRosner/struct-udf, which 
> exposes the {{UserDefinedFunction}} constructor (public from package 
> private). However, then you have to work with a {{Row}}, because it does not 
> automatically convert the row to a case class / tuple.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12823) Cannot create UDF with StructType input

2017-09-20 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16173702#comment-16173702
 ] 

Simeon H.K. Fitch commented on SPARK-12823:
---

As a coda, it's interesting to note that you can *return* a `Product` type from 
a UDF, and the conversion works fine:

{code:java}
  val litValue = udf(() ⇒ KV(4l, "four"))

  ds.select(litValue()).show
  //  ++
  //  |   UDF()|
  //  ++
  //  |[4,four]|
  //  |[4,four]|
  //  ++
{code}

So there's a weird asymmetry to it as well.

> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a {{UserDefinedFunction}} manually, not using the {{udf}} 
> helper functions, it works. See https://github.com/FRosner/struct-udf, which 
> exposes the {{UserDefinedFunction}} constructor (public from package 
> private). However, then you have to work with a {{Row}}, because it does not 
> automatically convert the row to a case class / tuple.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12823) Cannot create UDF with StructType input

2017-09-19 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171735#comment-16171735
 ] 

Simeon H.K. Fitch commented on SPARK-12823:
---

[~cloud_fan] If the intent is to *not* support Encoder supported types in UDFs, 
then the bug should be restated: 

> Spark should throw a compiler error when non-Catalyst types are used in UDFs.

This could be done with type classes.

That said, because the machinery is already there to transform between 
`DataFrame`s and `Dataset`s via the `TypedColumn` API, it seems to me that 
Catalyst is very close to being able to support any type that has an `Encoder` 
in a UDF.

> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a {{UserDefinedFunction}} manually, not using the {{udf}} 
> helper functions, it works. See https://github.com/FRosner/struct-udf, which 
> exposes the {{UserDefinedFunction}} constructor (public from package 
> private). However, then you have to work with a {{Row}}, because it does not 
> automatically convert the row to a case class / tuple.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12823) Cannot create UDF with StructType input

2017-09-19 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171668#comment-16171668
 ] 

Simeon H.K. Fitch edited comment on SPARK-12823 at 9/19/17 1:47 PM:


Here is a combined, runnable example, including an attempt at using SQL:

{code:java}
package examples

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

object UDFSadness extends App {
  implicit val spark = SparkSession.builder()
.master("local").appName(getClass.getName).getOrCreate()
  import spark.implicits._

  case class KV(key: Long, value: String)
  case class MyRow(kv: KV)

  val ds: Dataset[MyRow] = spark.createDataset(List(MyRow(KV(1L, "a")), 
MyRow(KV(5L, "b"

  val firstColumn = ds(ds.columns.head)

  // Works, but is not what we want (can't always use `map` over `select`)
  ds.map(_.kv.value).show
  // +-+
  // |value|
  // +-+
  // |a|
  // |b|
  // +-+

  // This is what we want to be able to implement
  val udf1 = udf((row: MyRow) ⇒ row.kv.value)

  try {
ds.select(udf1(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// Exception in thread "main" org.apache.spark.sql.AnalysisException:
// cannot resolve 'UDF(kv)' due to data type mismatch: argument 1 requires
// struct> type, however,
// '`kv`' is of struct type.;;
  }

  // So lets try something of the form reported in the error
  val udf2 = udf((kv: KV) ⇒ kv.value)

  try {
ds.select(udf2(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
// cannot be cast to examples.UDFSadness$KV
  }

  // What if it's a problem with the use of untyped columns?
  // Try the above again with typed columns.

  try {
ds.select(udf1(firstColumn.as[MyRow])).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
data type
// mismatch: argument 1 requires struct> 
type,
// however, '`kv`' is of struct type.;;
  }

  try {
ds.select(udf2(firstColumn.as[KV])).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
// cannot be cast to examples.UDFSadness$KV

  }

  // Let's see if we can use SQL as a back door.
  spark.sqlContext.udf.register("udf1", (row: MyRow) ⇒ row.kv.value)
  try {
ds.createOrReplaceTempView("myKVs")
spark.sql(s"select udf1($firstColumn) from myKVs").show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
data
// type mismatch: argument 1 requires 
struct> type,
// however, 'mykvs.`kv`' is of struct type.; line 
1 pos 7;
  }

  // This is the unfortunate workaround. Note that `Row` is
  // `org.apache.spark.sql.Row`
  val udf3 = udf((row: Row) ⇒ row.getString(1))

  ds.select(udf3(firstColumn)).show
  //  +---+
  //  |UDF(kv)|
  //  +---+
  //  |  a|
  //  |  b|
  //  +---+

}

{code}



was (Author: metasim):
Here is a combined, runnable example, including an attempt at using SQL:

{code:java}
package examples

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

object UDFSadness extends App {
  implicit val spark = SparkSession.builder()
.master("local").appName(getClass.getName).getOrCreate()
  import spark.implicits._

  case class KV(key: Long, value: String)
  case class MyRow(kv: KV)

  val ds: Dataset[MyRow] = spark.createDataset(List(MyRow(KV(1L, "a")), 
MyRow(KV(5L, "b"

  val firstColumn = ds(ds.columns.head)

  // Works, but is not what we want (can't always use `map` over `select`)
  ds.map(_.kv.value).show
  // +-+
  // |value|
  // +-+
  // |a|
  // |b|
  // +-+

  // This is what we want to be able to implement
  val udf1 = udf((row: MyRow) ⇒ row.kv.value)

  try {
ds.select(udf1(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// Exception in thread "main" org.apache.spark.sql.AnalysisException:
// cannot resolve 'UDF(kv)' due to data type mismatch: argument 1 requires
// struct> type, however,
// '`kv`' is of struct type.;;
  }

  // So lets try something of the form reported in the error
  val udf2 = udf((kv: KV) ⇒ kv.value)

  try {
ds.select(udf2(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ //t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
//

[jira] [Comment Edited] (SPARK-12823) Cannot create UDF with StructType input

2017-09-19 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171668#comment-16171668
 ] 

Simeon H.K. Fitch edited comment on SPARK-12823 at 9/19/17 1:30 PM:


Here is a combined, runnable example, including an attempt at using SQL:

{code:java}
package examples

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

object UDFSadness extends App {
  implicit val spark = SparkSession.builder()
.master("local").appName(getClass.getName).getOrCreate()
  import spark.implicits._

  case class KV(key: Long, value: String)
  case class MyRow(kv: KV)

  val ds: Dataset[MyRow] = spark.createDataset(List(MyRow(KV(1L, "a")), 
MyRow(KV(5L, "b"

  val firstColumn = ds(ds.columns.head)

  // Works, but is not what we want (can't always use `map` over `select`)
  ds.map(_.kv.value).show
  // +-+
  // |value|
  // +-+
  // |a|
  // |b|
  // +-+

  // This is what we want to be able to implement
  val udf1 = udf((row: MyRow) ⇒ row.kv.value)

  try {
ds.select(udf1(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// Exception in thread "main" org.apache.spark.sql.AnalysisException:
// cannot resolve 'UDF(kv)' due to data type mismatch: argument 1 requires
// struct> type, however,
// '`kv`' is of struct type.;;
  }

  // So lets try something of the form reported in the error
  val udf2 = udf((kv: KV) ⇒ kv.value)

  try {
ds.select(udf2(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ //t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
// cannot be cast to examples.UDFSadness$KV
  }

  // What if it's a problem with the use of untyped columns?
  // Try the above again with typed columns.

  try {
ds.select(udf1(firstColumn.as[MyRow])).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
data type
// mismatch: argument 1 requires struct> 
type,
// however, '`kv`' is of struct type.;;
  }

  try {
ds.select(udf2(firstColumn.as[KV])).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
// cannot be cast to examples.UDFSadness$KV

  }

  // Let's see if we can use SQL as a back door.
  spark.sqlContext.udf.register("udf1", (row: MyRow) ⇒ row.kv.value)
  try {
ds.createOrReplaceTempView("myKVs")
spark.sql(s"select udf1($firstColumn) from myKVs").show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
data
// type mismatch: argument 1 requires 
struct> type,
// however, 'mykvs.`kv`' is of struct type.; line 
1 pos 7;
  }

  // This is the unfortunate workaround. Note that `Row` is
  // `org.apache.spark.sql.Row`
  val udf3 = udf((row: Row) ⇒ row.getString(1))

  ds.select(udf3(firstColumn)).show
  //  +---+
  //  |UDF(kv)|
  //  +---+
  //  |  a|
  //  |  b|
  //  +---+

}

{code}



was (Author: metasim):
Here is a combined, runnable example:


{code:java}

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

object UDFSadness extends App {
  implicit val spark = SparkSession.builder()
.master("local").appName(getClass.getName).getOrCreate()
  import spark.implicits._

  case class KV(key: Long, value: String)
  case class MyRow(kv: KV)

  val ds: Dataset[MyRow] = spark.createDataset(List(MyRow(KV(1L, "a")), 
MyRow(KV(5L, "b"

  val firstColumn = ds(ds.columns.head)

  // Works, but is not what we want (can't always use `map` over `select`)
  ds.map(_.kv.value).show
  // +-+
  // |value|
  // +-+
  // |a|
  // |b|
  // +-+

  // This is what we want to be able to implement
  val udf1 = udf((row: MyRow) ⇒ row.kv.value)

  try {
ds.select(udf1(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// Exception in thread "main" org.apache.spark.sql.AnalysisException:
// cannot resolve 'UDF(kv)' due to data type mismatch: argument 1 requires
// struct> type, however,
// '`kv`' is of struct type.;;
  }

  // So lets try something of the form reported in the error
  val udf2 = udf((kv: KV) ⇒ kv.value)

  try {
ds.select(udf2(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ //t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
// cannot be cast to examples.UDFSadness$KV
  }

  //

[jira] [Comment Edited] (SPARK-12823) Cannot create UDF with StructType input

2017-09-19 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171668#comment-16171668
 ] 

Simeon H.K. Fitch edited comment on SPARK-12823 at 9/19/17 1:24 PM:


Here is a combined, runnable example:


{code:java}

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

object UDFSadness extends App {
  implicit val spark = SparkSession.builder()
.master("local").appName(getClass.getName).getOrCreate()
  import spark.implicits._

  case class KV(key: Long, value: String)
  case class MyRow(kv: KV)

  val ds: Dataset[MyRow] = spark.createDataset(List(MyRow(KV(1L, "a")), 
MyRow(KV(5L, "b"

  val firstColumn = ds(ds.columns.head)

  // Works, but is not what we want (can't always use `map` over `select`)
  ds.map(_.kv.value).show
  // +-+
  // |value|
  // +-+
  // |a|
  // |b|
  // +-+

  // This is what we want to be able to implement
  val udf1 = udf((row: MyRow) ⇒ row.kv.value)

  try {
ds.select(udf1(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// Exception in thread "main" org.apache.spark.sql.AnalysisException:
// cannot resolve 'UDF(kv)' due to data type mismatch: argument 1 requires
// struct> type, however,
// '`kv`' is of struct type.;;
  }

  // So lets try something of the form reported in the error
  val udf2 = udf((kv: KV) ⇒ kv.value)

  try {
ds.select(udf2(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ //t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
// cannot be cast to examples.UDFSadness$KV
  }

  // What if it's a problem with the use of untyped columns?
  // Try the above again with typed columns.

  try {
ds.select(udf1(firstColumn.as[MyRow])).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
data type
// mismatch: argument 1 requires struct> 
type,
// however, '`kv`' is of struct type.;;
  }

  try {
ds.select(udf2(firstColumn.as[KV])).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
// cannot be cast to examples.UDFSadness$KV

  }

  // This is the unfortunate workaround. Note that `Row` is
  // `org.apache.spark.sql.Row`
  val udf3 = udf((row: Row) ⇒ row.getString(1))

  ds.select(udf3(firstColumn)).show
  //  +---+
  //  |UDF(kv)|
  //  +---+
  //  |  a|
  //  |  b|
  //  +---+

}

{code}



was (Author: metasim):
Here is a combined, runnable example:


{code:java}

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

object UDFSadness extends App {
  implicit val spark = SparkSession.builder()
.master("local").appName(getClass.getName).getOrCreate()
  import spark.implicits._

  case class KV(key: Long, value: String)
  case class MyRow(kv: KV)

  val ds: Dataset[MyRow] = spark.createDataset(List(MyRow(KV(1L, "a")), 
MyRow(KV(5L, "b"

  val firstColumn = ds(ds.columns.head)

  // Works, but is not what we want (can't always use `map` over `select`)
  ds.map(_.kv.value).show

  // This is what we want to be able to implement
  val udf1 = udf((row: MyRow) ⇒ row.kv.value)

  try {
ds.select(udf1(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// Exception in thread "main" org.apache.spark.sql.AnalysisException:
// cannot resolve 'UDF(kv)' due to data type mismatch: argument 1 requires
// struct> type, however,
// '`kv`' is of struct type.;;
  }

  // So lets try something of the form reported in the error
  val udf2 = udf((kv: KV) ⇒ kv.value)

  try {
ds.select(udf2(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ //t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
// cannot be cast to examples.UDFSadness$KV
  }

  // What if it's a problem with the use of untyped columns?
  // Try the above again with typed columns.

  try {
ds.select(udf1(firstColumn.as[MyRow])).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
data type
// mismatch: argument 1 requires struct> 
type,
// however, '`kv`' is of struct type.;;
  }

  try {
ds.select(udf2(firstColumn.as[KV])).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema

[jira] [Comment Edited] (SPARK-12823) Cannot create UDF with StructType input

2017-09-19 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171668#comment-16171668
 ] 

Simeon H.K. Fitch edited comment on SPARK-12823 at 9/19/17 1:22 PM:


Here is a combined, runnable example:


{code:java}

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

object UDFSadness extends App {
  implicit val spark = SparkSession.builder()
.master("local").appName(getClass.getName).getOrCreate()
  import spark.implicits._

  case class KV(key: Long, value: String)
  case class MyRow(kv: KV)

  val ds: Dataset[MyRow] = spark.createDataset(List(MyRow(KV(1L, "a")), 
MyRow(KV(5L, "b"

  val firstColumn = ds(ds.columns.head)

  // Works, but is not what we want (can't always use `map` over `select`)
  ds.map(_.kv.value).show

  // This is what we want to be able to implement
  val udf1 = udf((row: MyRow) ⇒ row.kv.value)

  try {
ds.select(udf1(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// Exception in thread "main" org.apache.spark.sql.AnalysisException:
// cannot resolve 'UDF(kv)' due to data type mismatch: argument 1 requires
// struct> type, however,
// '`kv`' is of struct type.;;
  }

  // So lets try something of the form reported in the error
  val udf2 = udf((kv: KV) ⇒ kv.value)

  try {
ds.select(udf2(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ //t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
// cannot be cast to examples.UDFSadness$KV
  }

  // What if it's a problem with the use of untyped columns?
  // Try the above again with typed columns.

  try {
ds.select(udf1(firstColumn.as[MyRow])).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
data type
// mismatch: argument 1 requires struct> 
type,
// however, '`kv`' is of struct type.;;
  }

  try {
ds.select(udf2(firstColumn.as[KV])).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
// cannot be cast to examples.UDFSadness$KV

  }

  // This is the unfortunate workaround. Note that `Row` is 
  // `org.apache.spark.sql.Row`
  val udf3 = udf((row: Row) ⇒ row.getString(1))

  ds.select(udf3(firstColumn)).show

}

{code}



was (Author: metasim):
Here is a combined, runnable example:


{code:java}

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

object UDFSadness extends App {
  implicit val spark = SparkSession.builder()
.master("local").appName(getClass.getName).getOrCreate()
  import spark.implicits._

  case class KV(key: Long, value: String)
  case class MyRow(kv: KV)

  val ds: Dataset[MyRow] = spark.createDataset(List(MyRow(KV(1L, "a")), 
MyRow(KV(5L, "b"

  val firstColumn = ds(ds.columns.head)

  // Works, but is not what we want (can't always use `map` over `select`)
  ds.map(_.kv.value).show

  // This is what we want to be able to implement
  val udf1 = udf((row: MyRow) ⇒ row.kv.value)

  try {
ds.select(udf1(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// Exception in thread "main" org.apache.spark.sql.AnalysisException:
// cannot resolve 'UDF(kv)' due to data type mismatch: argument 1 requires
// struct> type, however,
// '`kv`' is of struct type.;;
  }

  // So lets try something of the form reported in the error
  val udf2 = udf((kv: KV) ⇒ kv.value)

  try {
ds.select(udf2(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ //t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
// cannot be cast to examples.UDFSadness$KV
  }

  // What if it's a problem with the use of untyped columns?
  // Try the above again with typed columns.

  try {
ds.select(udf1(firstColumn.as[MyRow])).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
data type
// mismatch: argument 1 requires struct> 
type,
// however, '`kv`' is of struct type.;;
  }

  try {
ds.select(udf2(firstColumn.as[KV])).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
// cannot be cast to examples.UDFSadness$KV

  }

  // This is the unfortunate workaround:
  val udf3 = udf((row: Row) ⇒ row.getString(1))

[jira] [Commented] (SPARK-12823) Cannot create UDF with StructType input

2017-09-19 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171668#comment-16171668
 ] 

Simeon H.K. Fitch commented on SPARK-12823:
---

Here is a combined, runnable example:


{code:java}

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

object UDFSadness extends App {
  implicit val spark = SparkSession.builder()
.master("local").appName(getClass.getName).getOrCreate()
  import spark.implicits._

  case class KV(key: Long, value: String)
  case class MyRow(kv: KV)

  val ds: Dataset[MyRow] = spark.createDataset(List(MyRow(KV(1L, "a")), 
MyRow(KV(5L, "b"

  val firstColumn = ds(ds.columns.head)

  // Works, but is not what we want (can't always use `map` over `select`
  ds.map(_.kv.value).show

  // This is what we want to be able to implement
  val udf1 = udf((row: MyRow) ⇒ row.kv.value)

  try {
ds.select(udf1(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// Exception in thread "main" org.apache.spark.sql.AnalysisException:
// cannot resolve 'UDF(kv)' due to data type mismatch: argument 1 requires
// struct> type, however,
// '`kv`' is of struct type.;;
  }

  // So lets try something of the form reported in the error
  val udf2 = udf((kv: KV) ⇒ kv.value)

  try {
ds.select(udf2(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ //t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
// cannot be cast to examples.UDFSadness$KV
  }

  // What if it's a problem with the use of untyped columns?
  // Try the above again with typed columns.

  try {
ds.select(udf1(firstColumn.as[MyRow])).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
data type
// mismatch: argument 1 requires struct> 
type,
// however, '`kv`' is of struct type.;;
  }

  try {
ds.select(udf2(firstColumn.as[KV])).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
// cannot be cast to examples.UDFSadness$KV

  }

  // This is the unfortunate workaround:
  val udf3 = udf((row: Row) ⇒ row.getString(1))

  ds.select(udf3(firstColumn)).show

}

{code}


> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a {{UserDefinedFunction}} manually, not using the {{udf}} 
> helper functions, it works. See https://github.com/FRosner/struct-udf, which 
> exposes the {{UserDefinedFunction}} constructor (public from package 
> private). However, then you have to work with a {{Row}}, because it does not 
> automatically convert the row to a case class / tuple.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12823) Cannot create UDF with StructType input

2017-09-19 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171668#comment-16171668
 ] 

Simeon H.K. Fitch edited comment on SPARK-12823 at 9/19/17 1:21 PM:


Here is a combined, runnable example:


{code:java}

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

object UDFSadness extends App {
  implicit val spark = SparkSession.builder()
.master("local").appName(getClass.getName).getOrCreate()
  import spark.implicits._

  case class KV(key: Long, value: String)
  case class MyRow(kv: KV)

  val ds: Dataset[MyRow] = spark.createDataset(List(MyRow(KV(1L, "a")), 
MyRow(KV(5L, "b"

  val firstColumn = ds(ds.columns.head)

  // Works, but is not what we want (can't always use `map` over `select`)
  ds.map(_.kv.value).show

  // This is what we want to be able to implement
  val udf1 = udf((row: MyRow) ⇒ row.kv.value)

  try {
ds.select(udf1(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// Exception in thread "main" org.apache.spark.sql.AnalysisException:
// cannot resolve 'UDF(kv)' due to data type mismatch: argument 1 requires
// struct> type, however,
// '`kv`' is of struct type.;;
  }

  // So lets try something of the form reported in the error
  val udf2 = udf((kv: KV) ⇒ kv.value)

  try {
ds.select(udf2(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ //t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
// cannot be cast to examples.UDFSadness$KV
  }

  // What if it's a problem with the use of untyped columns?
  // Try the above again with typed columns.

  try {
ds.select(udf1(firstColumn.as[MyRow])).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
data type
// mismatch: argument 1 requires struct> 
type,
// however, '`kv`' is of struct type.;;
  }

  try {
ds.select(udf2(firstColumn.as[KV])).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
// cannot be cast to examples.UDFSadness$KV

  }

  // This is the unfortunate workaround:
  val udf3 = udf((row: Row) ⇒ row.getString(1))

  ds.select(udf3(firstColumn)).show

}

{code}



was (Author: metasim):
Here is a combined, runnable example:


{code:java}

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

object UDFSadness extends App {
  implicit val spark = SparkSession.builder()
.master("local").appName(getClass.getName).getOrCreate()
  import spark.implicits._

  case class KV(key: Long, value: String)
  case class MyRow(kv: KV)

  val ds: Dataset[MyRow] = spark.createDataset(List(MyRow(KV(1L, "a")), 
MyRow(KV(5L, "b"

  val firstColumn = ds(ds.columns.head)

  // Works, but is not what we want (can't always use `map` over `select`
  ds.map(_.kv.value).show

  // This is what we want to be able to implement
  val udf1 = udf((row: MyRow) ⇒ row.kv.value)

  try {
ds.select(udf1(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// Exception in thread "main" org.apache.spark.sql.AnalysisException:
// cannot resolve 'UDF(kv)' due to data type mismatch: argument 1 requires
// struct> type, however,
// '`kv`' is of struct type.;;
  }

  // So lets try something of the form reported in the error
  val udf2 = udf((kv: KV) ⇒ kv.value)

  try {
ds.select(udf2(firstColumn)).show
  }
  catch {
case t: Throwable ⇒ //t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
// cannot be cast to examples.UDFSadness$KV
  }

  // What if it's a problem with the use of untyped columns?
  // Try the above again with typed columns.

  try {
ds.select(udf1(firstColumn.as[MyRow])).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
data type
// mismatch: argument 1 requires struct> 
type,
// however, '`kv`' is of struct type.;;
  }

  try {
ds.select(udf2(firstColumn.as[KV])).show
  }
  catch {
case t: Throwable ⇒ t.printStackTrace()
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
// cannot be cast to examples.UDFSadness$KV

  }

  // This is the unfortunate workaround:
  val udf3 = udf((row: Row) ⇒ row.getString(1))

  ds.select(udf3(firstColumn)).show

}

{code}


> Cannot create UDF with StructType

[jira] [Comment Edited] (SPARK-12823) Cannot create UDF with StructType input

2017-09-19 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171627#comment-16171627
 ] 

Simeon H.K. Fitch edited comment on SPARK-12823 at 9/19/17 12:53 PM:
-

When you say "Row", if you meant `org.apache.spark.sql.Row`? then yes, this 
works:

{code:java}
val udf2 = udf((row: Row) => row.getString(1))
{code}

But that's requiring me to keep track of the schema when Catalyst (through 
`Encoder`) has all the type information it needs to be able to reify that for 
me. There shouldn't be any technical reason why `udf((row: KV) ⇒ row.value)` 
shouldn't be allowed. And even if there is (i.e. I'm wrong), it should be a 
compile time error, not a runtime one.

But the example defined its own `Row` class making this all more confusing. I 
tried `udf((row: NotTheSparkRow) => row.kv.value)` but that doesn't work either.


was (Author: metasim):
When you say "Row", if you meant `org.apache.spark.sql.Row`? then yes, this 
works:

{code:java}
val udf2 = udf((row: Row) => row.getString(1))
{code}

But that's requiring me to keep track of the schema when Catalyst (through 
`Encoder`) has all the type information it needs to be able to reify that for 
me. There shouldn't be any technical reason why `udf((row: KV) ⇒ row.value)` 
shouldn't be allowed. And even if there is (i.e. I'm wrong), it should be a 
compile time error, not a runtime one.

But the example defined its own `Row` class making this all more confusing. I 
tried `udf((row: Row) => row.kv.value)` but that doesn't work either.

> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a {{UserDefinedFunction}} manually, not using the {{udf}} 
> helper functions, it works. See https://github.com/FRosner/struct-udf, which 
> exposes the {{UserDefinedFunction}} constructor (public from package 
> private). However, then you have to work with a {{Row}}, because it does not 
> automatically convert the row to a case class / tuple.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12823) Cannot create UDF with StructType input

2017-09-19 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171627#comment-16171627
 ] 

Simeon H.K. Fitch edited comment on SPARK-12823 at 9/19/17 12:52 PM:
-

When you say "Row", if you meant `org.apache.spark.sql.Row`? then yes, this 
works:

{code:java}
val udf2 = udf((row: Row) => row.getString(1))
{code}

But that's requiring me to keep track of the schema when Catalyst (through 
`Encoder`) has all the type information it needs to be able to reify that for 
me. There shouldn't be any technical reason why `udf((row: KV) ⇒ row.value)` 
shouldn't be allowed. And even if there is (i.e. I'm wrong), it should be a 
compile time error, not a runtime one.

But the example defined its own `Row` class making this all more confusing. I 
tried `udf((row: Row) => row.value.value)` but that doesn't work either.


was (Author: metasim):
When you say "Row", do you mean `org.apache.spark.sql.Row`? If so, then yes, 
this works:

{code:java}
val udf2 = udf((row: Row) => row.getString(1))
{code}

But that's requiring me to keep track of the schema when Catalyst (through 
`Encoder`) has all the type information it needs to be able to reify that for 
me. There shouldn't be any technical reason why `udf((row: KV) ⇒ row.value)` 
shouldn't be allowed. And even if there is (i.e. I'm wrong), it should be a 
compile time error, not a runtime one.

> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a {{UserDefinedFunction}} manually, not using the {{udf}} 
> helper functions, it works. See https://github.com/FRosner/struct-udf, which 
> exposes the {{UserDefinedFunction}} constructor (public from package 
> private). However, then you have to work with a {{Row}}, because it does not 
> automatically convert the row to a case class / tuple.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12823) Cannot create UDF with StructType input

2017-09-19 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171627#comment-16171627
 ] 

Simeon H.K. Fitch edited comment on SPARK-12823 at 9/19/17 12:52 PM:
-

When you say "Row", if you meant `org.apache.spark.sql.Row`? then yes, this 
works:

{code:java}
val udf2 = udf((row: Row) => row.getString(1))
{code}

But that's requiring me to keep track of the schema when Catalyst (through 
`Encoder`) has all the type information it needs to be able to reify that for 
me. There shouldn't be any technical reason why `udf((row: KV) ⇒ row.value)` 
shouldn't be allowed. And even if there is (i.e. I'm wrong), it should be a 
compile time error, not a runtime one.

But the example defined its own `Row` class making this all more confusing. I 
tried `udf((row: Row) => row.kv.value)` but that doesn't work either.


was (Author: metasim):
When you say "Row", if you meant `org.apache.spark.sql.Row`? then yes, this 
works:

{code:java}
val udf2 = udf((row: Row) => row.getString(1))
{code}

But that's requiring me to keep track of the schema when Catalyst (through 
`Encoder`) has all the type information it needs to be able to reify that for 
me. There shouldn't be any technical reason why `udf((row: KV) ⇒ row.value)` 
shouldn't be allowed. And even if there is (i.e. I'm wrong), it should be a 
compile time error, not a runtime one.

But the example defined its own `Row` class making this all more confusing. I 
tried `udf((row: Row) => row.value.value)` but that doesn't work either.

> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a {{UserDefinedFunction}} manually, not using the {{udf}} 
> helper functions, it works. See https://github.com/FRosner/struct-udf, which 
> exposes the {{UserDefinedFunction}} constructor (public from package 
> private). However, then you have to work with a {{Row}}, because it does not 
> automatically convert the row to a case class / tuple.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12823) Cannot create UDF with StructType input

2017-09-19 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171627#comment-16171627
 ] 

Simeon H.K. Fitch commented on SPARK-12823:
---

When you say "Row", do you mean `org.apache.spark.sql.Row`? If so, then yes, 
this works:

{code:java}
val udf2 = udf((row: Row) => row.getString(1))
{code}

But that's requiring me to keep track of the schema when Catalyst (through 
`Encoder`) has all the type information it needs to be able to reify that for 
me. There shouldn't be any technical reason why `udf((row: KV) ⇒ row.value)` 
shouldn't be allowed. And even if there is (i.e. I'm wrong), it should be a 
compile time error, not a runtime one.

> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a {{UserDefinedFunction}} manually, not using the {{udf}} 
> helper functions, it works. See https://github.com/FRosner/struct-udf, which 
> exposes the {{UserDefinedFunction}} constructor (public from package 
> private). However, then you have to work with a {{Row}}, because it does not 
> automatically convert the row to a case class / tuple.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12823) Cannot create UDF with StructType input

2017-09-18 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16170540#comment-16170540
 ] 

Simeon H.K. Fitch edited comment on SPARK-12823 at 9/18/17 7:33 PM:


My suspicion is that the bug is 
[here|https://github.com/apache/spark/blob/c66d64b3df9d9ffba0b16a62015680f6f876fc68/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L257]
 (converts from Product, but not back to it), or 
[here|https://github.com/apache/spark/blob/c66d64b3df9d9ffba0b16a62015680f6f876fc68/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L57]
 (special case for UserDefinedType, but not for Product).




was (Author: metasim):
My suspicion is that the bug is 
[here](https://github.com/apache/spark/blob/c66d64b3df9d9ffba0b16a62015680f6f876fc68/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L257)
 (converts from Product, but not back to it), or 
[here](https://github.com/apache/spark/blob/c66d64b3df9d9ffba0b16a62015680f6f876fc68/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L57)
 (special case for UserDefinedType, but not for Product).



> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a {{UserDefinedFunction}} manually, not using the {{udf}} 
> helper functions, it works. See https://github.com/FRosner/struct-udf, which 
> exposes the {{UserDefinedFunction}} constructor (public from package 
> private). However, then you have to work with a {{Row}}, because it does not 
> automatically convert the row to a case class / tuple.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12823) Cannot create UDF with StructType input

2017-09-18 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16170540#comment-16170540
 ] 

Simeon H.K. Fitch commented on SPARK-12823:
---

My suspicion is that the bug is 
[here](https://github.com/apache/spark/blob/c66d64b3df9d9ffba0b16a62015680f6f876fc68/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L257)
 (converts from Product, but not back to it), or 
[here](https://github.com/apache/spark/blob/c66d64b3df9d9ffba0b16a62015680f6f876fc68/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L57)
 (special case for UserDefinedType, but not for Product).



> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a {{UserDefinedFunction}} manually, not using the {{udf}} 
> helper functions, it works. See https://github.com/FRosner/struct-udf, which 
> exposes the {{UserDefinedFunction}} constructor (public from package 
> private). However, then you have to work with a {{Row}}, because it does not 
> automatically convert the row to a case class / tuple.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12823) Cannot create UDF with StructType input

2017-09-18 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16170511#comment-16170511
 ] 

Simeon H.K. Fitch edited comment on SPARK-12823 at 9/18/17 7:06 PM:


Unfortunately, using a `TypedColumn` doesn't help either.

{code:java}
val ds = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDS

val udf1 = udf((row: KV) ⇒ row.value)

ds.select(udf1(ds(ds.columns.head).as[Row])).show
//  java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
to KV

{code}

Flummoxed that I'm only now running into his problem, and that it hasn't been 
fixed yet. Seems kinda major to me.


was (Author: metasim):
Unfortunately, using a `TypedColumn` doesn't help either.

{code:scala}
val ds = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDS

val udf1 = udf((row: KV) ⇒ row.value)

ds.select(udf1(ds(ds.columns.head).as[Row])).show
//  java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
to KV

{code}

Flummoxed that I'm only now running into his problem, and that it hasn't been 
fixed yet. Seems kinda major to me.

> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a {{UserDefinedFunction}} manually, not using the {{udf}} 
> helper functions, it works. See https://github.com/FRosner/struct-udf, which 
> exposes the {{UserDefinedFunction}} constructor (public from package 
> private). However, then you have to work with a {{Row}}, because it does not 
> automatically convert the row to a case class / tuple.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12823) Cannot create UDF with StructType input

2017-09-18 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16170511#comment-16170511
 ] 

Simeon H.K. Fitch commented on SPARK-12823:
---

Unfortunately, using a `TypedColumn` doesn't help either.

{code:scala}
val ds = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDS

val udf1 = udf((row: KV) ⇒ row.value)

ds.select(udf1(ds(ds.columns.head).as[Row])).show
//  java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
to KV

{code}

Flummoxed that I'm only now running into his problem, and that it hasn't been 
fixed yet. Seems kinda major to me.

> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a {{UserDefinedFunction}} manually, not using the {{udf}} 
> helper functions, it works. See https://github.com/FRosner/struct-udf, which 
> exposes the {{UserDefinedFunction}} constructor (public from package 
> private). However, then you have to work with a {{Row}}, because it does not 
> automatically convert the row to a case class / tuple.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2017-06-02 Thread Simeon H.K. Fitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035356#comment-16035356
 ] 

Simeon H.K. Fitch commented on SPARK-7768:
--

[~pgrandjean] Once a UDT is registered, the `ExpressionEncoder` class (usually 
invoked by the functions in `Encoders`) automatically makes use of it.

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19515) `flat` property in `ExpressionEncoder` needs documentation

2017-02-08 Thread Simeon H.K. Fitch (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon H.K. Fitch updated SPARK-19515:
--
Description: 
It would be helpful to understand the semantics of the `flat` property in 
`ExpressionEncoder`, as this is the entry-point into implementing custom 
`Encoder`s for dataframes.

[Source code 
location|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L221]

> `flat` property in `ExpressionEncoder` needs documentation
> --
>
> Key: SPARK-19515
> URL: https://issues.apache.org/jira/browse/SPARK-19515
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Simeon H.K. Fitch
>Priority: Minor
>
> It would be helpful to understand the semantics of the `flat` property in 
> `ExpressionEncoder`, as this is the entry-point into implementing custom 
> `Encoder`s for dataframes.
> [Source code 
> location|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L221]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19515) `flat` property in `ExpressionEncoder` needs documentation

2017-02-08 Thread Simeon H.K. Fitch (JIRA)

Simeon H.K. Fitch created SPARK-19515:
-

 Summary: `flat` property in `ExpressionEncoder` needs documentation
 Key: SPARK-19515
 URL: https://issues.apache.org/jira/browse/SPARK-19515
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.1.0
Reporter: Simeon H.K. Fitch
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

40 matches

Mail list logo