[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-09-02 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r214567945
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -40,7 +41,7 @@ import org.apache.spark.sql.types.DataType
 case class UserDefinedFunction protected[sql] (
 f: AnyRef,
 dataType: DataType,
-inputTypes: Option[Seq[DataType]]) {
+inputTypes: Option[Seq[ScalaReflection.Schema]]) {
--- End diff --

(BTW it was PR https://github.com/apache/spark/pull/22259 that was merged)

We can add back accessors, constructors, if it would make life easier for 
callers. But if this is protected, who are the callers of this code we're 
accommodating? maybe some hacky but important integration?

We'd have to rename `inputTypes` and then add back an accessor with the old 
type.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-09-02 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r214560630
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -40,7 +41,7 @@ import org.apache.spark.sql.types.DataType
 case class UserDefinedFunction protected[sql] (
 f: AnyRef,
 dataType: DataType,
-inputTypes: Option[Seq[DataType]]) {
+inputTypes: Option[Seq[ScalaReflection.Schema]]) {
--- End diff --

ah i see, it's a case class, so we would need to keep the `def 
inputTypes(): Option[Seq[DataType]]`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-09-02 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r214560519
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -40,7 +41,7 @@ import org.apache.spark.sql.types.DataType
 case class UserDefinedFunction protected[sql] (
 f: AnyRef,
 dataType: DataType,
-inputTypes: Option[Seq[DataType]]) {
+inputTypes: Option[Seq[ScalaReflection.Schema]]) {
--- End diff --

But the constructor is `protected[sql]`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-09-02 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r214560313
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -40,7 +41,7 @@ import org.apache.spark.sql.types.DataType
 case class UserDefinedFunction protected[sql] (
 f: AnyRef,
 dataType: DataType,
-inputTypes: Option[Seq[DataType]]) {
+inputTypes: Option[Seq[ScalaReflection.Schema]]) {
--- End diff --

This is a stable API. Are we able to make this change in 2.4 instead of 3.0?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-29 Thread adriaanm
Github user adriaanm commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r213702881
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala ---
@@ -164,19 +164,15 @@ abstract class ClassificationModel[FeaturesType, M <: 
ClassificationModel[Featur
 var outputData = dataset
 var numColsOutput = 0
 if (getRawPredictionCol != "") {
-  val predictRawUDF = udf { (features: Any) =>
--- End diff --

No problem! Happy to help with the 2.12 upgrade.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r213696131
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala ---
@@ -164,19 +164,15 @@ abstract class ClassificationModel[FeaturesType, M <: 
ClassificationModel[Featur
 var outputData = dataset
 var numColsOutput = 0
 if (getRawPredictionCol != "") {
-  val predictRawUDF = udf { (features: Any) =>
--- End diff --

I apologize, that's my mistake. In the end it isn't related to TypeTags for 
Any and that is not a difference. Thanks for your input, I think we are close. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-29 Thread srowen
Github user srowen closed the pull request at:

https://github.com/apache/spark/pull/22063


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-29 Thread adriaanm
Github user adriaanm commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r213593596
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala ---
@@ -164,19 +164,15 @@ abstract class ClassificationModel[FeaturesType, M <: 
ClassificationModel[Featur
 var outputData = dataset
 var numColsOutput = 0
 if (getRawPredictionCol != "") {
-  val predictRawUDF = udf { (features: Any) =>
--- End diff --

No idea, but in any case the new version seems nicer :-) Both 2.11 and 2.12 
will happily generate a `typeTag` for `Any`, though, so that wouldn't 
immediately explain it.  To see what was actually inferred, you could compile 
with `-Xprint:typer` (ideally after a full compile and then just making this 
file recompile incrementally).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-29 Thread skonto
Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r213575267
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala ---
@@ -164,19 +164,15 @@ abstract class ClassificationModel[FeaturesType, M <: 
ClassificationModel[Featur
 var outputData = dataset
 var numColsOutput = 0
 if (getRawPredictionCol != "") {
-  val predictRawUDF = udf { (features: Any) =>
--- End diff --

@adriaanm any more info?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-25 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212800597
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -3713,7 +3726,7 @@ object functions {
   | */
   |def udf[$typeTags](f: Function$x[$types]): UserDefinedFunction = {
   |  val ScalaReflection.Schema(dataType, nullable) = 
ScalaReflection.schemaFor[RT]
-  |  val inputTypes = Try($inputTypes).toOption
+  |  val inputTypes = Try($argSchema).toOption
--- End diff --

@cloud-fan might be worth another look now. So, after making this change, I 
realize, maybe the whole reason it started failing was that I had moved the 
schema inference outside the Try(). Now it's back inside. Maybe that makes the 
whole problem go back to being silent. Did you mean you preferred tackling the 
problem directly and not suppressing the failure to infer a schema? I added 
udfInternal above for that.

But maybe this isn't the best approach as user UDFs could fail for the same 
reason. Maybe I need to back this whole thing out after all, now that I 
understand what's happening after your comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212504250
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType
 case class UserDefinedFunction protected[sql] (
 f: AnyRef,
 dataType: DataType,
-inputTypes: Option[Seq[DataType]]) {
+inputTypes: Option[Seq[DataType]],
+nullableTypes: Option[Seq[Boolean]] = None) {
--- End diff --

+1 to all your comments. I'm overhauling this whole PR and will force push 
with a rebase once it seems to basically work.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212499322
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType
 case class UserDefinedFunction protected[sql] (
 f: AnyRef,
 dataType: DataType,
-inputTypes: Option[Seq[DataType]]) {
+inputTypes: Option[Seq[DataType]],
+nullableTypes: Option[Seq[Boolean]] = None) {
--- End diff --

or `Option[Seq[ScalaReflection.Schema]]`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212499164
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
 ---
@@ -57,8 +59,21 @@ case class ScalaUDF(
   children: Seq[Expression],
   inputTypes: Seq[DataType],
   udfName: Option[String]) = {
-this(
-  function, dataType, children, inputTypes, udfName, nullable = true, 
udfDeterministic = true)
+this(function, dataType, children, inputTypes, udfName,
+  nullable = true, udfDeterministic = true, nullableTypes = Nil)
+  }
+
+  // Constructor from Spark 2.3
--- End diff --

By convention, everything under catalyst package is private, so 
compatibility is not a concern here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212488387
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala ---
@@ -164,19 +164,15 @@ abstract class ClassificationModel[FeaturesType, M <: 
ClassificationModel[Featur
 var outputData = dataset
 var numColsOutput = 0
 if (getRawPredictionCol != "") {
-  val predictRawUDF = udf { (features: Any) =>
--- End diff --

@skonto @lrytz this might be of interest. Don't think it's a Scala issue 
per se but just checking if that behavior change makes sense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212445066
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
 ---
@@ -57,8 +59,21 @@ case class ScalaUDF(
   children: Seq[Expression],
   inputTypes: Seq[DataType],
   udfName: Option[String]) = {
-this(
-  function, dataType, children, inputTypes, udfName, nullable = true, 
udfDeterministic = true)
+this(function, dataType, children, inputTypes, udfName,
+  nullable = true, udfDeterministic = true, nullableTypes = Nil)
+  }
+
+  // Constructor from Spark 2.3
--- End diff --

Yeah, messy. The constructor and class are public so felt it was worth 
erring on the side of compatibility.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212393458
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala 
---
@@ -85,9 +85,8 @@ private[recommendation] trait ALSModelParams extends 
Params with HasPredictionCo
* Attempts to safely cast a user/item id to an Int. Throws an exception 
if the value is
* out of integer range or contains a fractional part.
*/
-  protected[recommendation] val checkedCast = udf { (n: Any) =>
+  protected[recommendation] val checkedCast = udf { n: AnyRef =>
--- End diff --

This doesn't even actually work, now that I dig further. Using `Number` 
doesn't work either. There are a number of UDFs that fail now in MLlib 
unfortunately.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212393299
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType
 case class UserDefinedFunction protected[sql] (
 f: AnyRef,
 dataType: DataType,
-inputTypes: Option[Seq[DataType]]) {
+inputTypes: Option[Seq[DataType]],
+nullableTypes: Option[Seq[Boolean]] = None) {
--- End diff --

I was hoping to minimize the change in the signature, but yeah it could be 
`Option[Seq[(DataType, Boolean)]]`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212393072
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType
 case class UserDefinedFunction protected[sql] (
 f: AnyRef,
 dataType: DataType,
-inputTypes: Option[Seq[DataType]]) {
+inputTypes: Option[Seq[DataType]],
+nullableTypes: Option[Seq[Boolean]] = None) {
+
+  // Constructor from Spark 2.3.0 for binary compatibility
--- End diff --

It was just for MiMa, but I could also suppress the warning


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212393002
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -375,8 +375,11 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
 import org.apache.spark.sql.functions.{rand, udf}
 val c = Column(col)
 val r = rand(seed)
-val f = udf { (stratum: Any, x: Double) =>
-  x < fractions.getOrElse(stratum.asInstanceOf[T], 0.0)
+// Hack to get around the fact that type T is Any and we can't use a 
UDF whose arg
+// is Any. Convert everything to a string rep.
--- End diff --

You are right that the change I made here is not bulletproof. Unfortunately 
there are several more problems like this. Anywhere there's a UDF on `Row` it 
now fails, and workarounds are ugly.

I like your idea, let me work on that. Because the alternative I've been 
working on is driving me nuts.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212392363
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala ---
@@ -164,19 +164,15 @@ abstract class ClassificationModel[FeaturesType, M <: 
ClassificationModel[Featur
 var outputData = dataset
 var numColsOutput = 0
 if (getRawPredictionCol != "") {
-  val predictRawUDF = udf { (features: Any) =>
--- End diff --

Thanks for your review @cloud-fan , I could really use your input here. 
That's a good find. It _may_ be that we want to explicitly support UDFs where a 
schema isn't available -- see below. But I agree I'd rather not. It gets kind 
of messy though.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212354787
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -375,8 +375,11 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
 import org.apache.spark.sql.functions.{rand, udf}
 val c = Column(col)
 val r = rand(seed)
-val f = udf { (stratum: Any, x: Double) =>
-  x < fractions.getOrElse(stratum.asInstanceOf[T], 0.0)
+// Hack to get around the fact that type T is Any and we can't use a 
UDF whose arg
+// is Any. Convert everything to a string rep.
--- End diff --

I feel udf using `Any` as input type is rare but a valid use case. 
Sometimes they just want to accept any input type.

How about we create a few `udfInternal` methods that takes `Any` as inputs? 
e.g.
```
def udfInternal[R: TypeTag](f: Function1[Any, R]): UserDefinedFunction = {
  val ScalaReflection.Schema(dataType, nullable) = 
ScalaReflection.schemaFor[R]
  val udf = UserDefinedFunction(f, dataType, Nil)
  if (nullable) udf else udf.asNonNullable()
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212353000
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType
 case class UserDefinedFunction protected[sql] (
 f: AnyRef,
 dataType: DataType,
-inputTypes: Option[Seq[DataType]]) {
+inputTypes: Option[Seq[DataType]],
+nullableTypes: Option[Seq[Boolean]] = None) {
--- End diff --

again, can we combine it with the data types?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212352751
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType
 case class UserDefinedFunction protected[sql] (
 f: AnyRef,
 dataType: DataType,
-inputTypes: Option[Seq[DataType]]) {
+inputTypes: Option[Seq[DataType]],
+nullableTypes: Option[Seq[Boolean]] = None) {
+
+  // Constructor from Spark 2.3.0 for binary compatibility
--- End diff --

why add this? `UserDefinedFunction` is public but its constructor is not.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212349025
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType
 case class UserDefinedFunction protected[sql] (
 f: AnyRef,
 dataType: DataType,
-inputTypes: Option[Seq[DataType]]) {
+inputTypes: Option[Seq[DataType]],
+nullableTypes: Option[Seq[Boolean]] = None) {
+
+  // Constructor from Spark 2.3.0 for binary compatibility
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212348950
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -375,8 +375,11 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
 import org.apache.spark.sql.functions.{rand, udf}
 val c = Column(col)
 val r = rand(seed)
-val f = udf { (stratum: Any, x: Double) =>
-  x < fractions.getOrElse(stratum.asInstanceOf[T], 0.0)
+// Hack to get around the fact that type T is Any and we can't use a 
UDF whose arg
+// is Any. Convert everything to a string rep.
--- End diff --

I'm not sure this is safe to do. Maybe `a == b` is different from 
`a.toString == b.toString`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212347965
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
 ---
@@ -57,8 +59,21 @@ case class ScalaUDF(
   children: Seq[Expression],
   inputTypes: Seq[DataType],
   udfName: Option[String]) = {
-this(
-  function, dataType, children, inputTypes, udfName, nullable = true, 
udfDeterministic = true)
+this(function, dataType, children, inputTypes, udfName,
+  nullable = true, udfDeterministic = true, nullableTypes = Nil)
+  }
+
+  // Constructor from Spark 2.3
--- End diff --

I'm not sure we want to keep doing this. cc @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212346762
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala 
---
@@ -85,9 +85,8 @@ private[recommendation] trait ALSModelParams extends 
Params with HasPredictionCo
* Attempts to safely cast a user/item id to an Int. Throws an exception 
if the value is
* out of integer range or contains a fractional part.
*/
-  protected[recommendation] val checkedCast = udf { (n: Any) =>
+  protected[recommendation] val checkedCast = udf { n: AnyRef =>
--- End diff --

I'm fine with this workaround, but I think we should create an expression 
for this `checkedCast`. The current UDF doesn't work well with inputs of 
different types.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r212345138
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala ---
@@ -164,19 +164,15 @@ abstract class ClassificationModel[FeaturesType, M <: 
ClassificationModel[Featur
 var outputData = dataset
 var numColsOutput = 0
 if (getRawPredictionCol != "") {
-  val predictRawUDF = udf { (features: Any) =>
--- End diff --

I looked into this, and now I understand why it worked before.

Scala 2.11 somehow can generate type tag for `Any`, then Spark gets the 
input schema from type tag `Try(ScalaReflection.schemaFor(typeTag[A1]).dataType 
:: Nil).toOption`. It will fail and input schema will be None, so no type check 
will be applied later.

I think it makes more sense to specify the type and ask Spark to do type 
check.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-13 Thread TomaszGaweda
Github user TomaszGaweda commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r209713521
  
--- Diff: project/MimaExcludes.scala ---
@@ -36,6 +36,11 @@ object MimaExcludes {
 
   // Exclude rules for 2.4.x
   lazy val v24excludes = v23excludes ++ Seq(
+// [SPARK-25044] Address translation of LMF closure primitive args to 
Object in Scala 2.1
--- End diff --

Nit: Wrong Scala version


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r209426781
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala ---
@@ -211,9 +211,7 @@ abstract class PredictionModel[FeaturesType, M <: 
PredictionModel[FeaturesType,
   }
 
   protected def transformImpl(dataset: Dataset[_]): DataFrame = {
-val predictUDF = udf { (features: Any) =>
--- End diff --

I'm really surprised that this worked before...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-09 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r209136524
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala ---
@@ -114,6 +114,7 @@ class UDFRegistration private[sql] (functionRegistry: 
FunctionRegistry) extends
   val types = (1 to x).foldRight("RT")((i, s) => {s"A$i, $s"})
   val typeTags = (1 to x).map(i => s"A$i: TypeTag").foldLeft("RT: 
TypeTag")(_ + ", " + _)
   val inputTypes = (1 to x).foldRight("Nil")((i, s) => 
{s"ScalaReflection.schemaFor[A$i].dataType :: $s"})
+  val nullableTypes = (1 to x).foldRight("Nil")((i, s) => 
{s"ScalaReflection.schemaFor[A$i].nullable :: $s"})
--- End diff --

Yeah that can be optimized. I'll fix the MiMa issue too by restoring a 
constructor.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r209135836
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala ---
@@ -114,6 +114,7 @@ class UDFRegistration private[sql] (functionRegistry: 
FunctionRegistry) extends
   val types = (1 to x).foldRight("RT")((i, s) => {s"A$i, $s"})
   val typeTags = (1 to x).map(i => s"A$i: TypeTag").foldLeft("RT: 
TypeTag")(_ + ", " + _)
   val inputTypes = (1 to x).foldRight("Nil")((i, s) => 
{s"ScalaReflection.schemaFor[A$i].dataType :: $s"})
+  val nullableTypes = (1 to x).foldRight("Nil")((i, s) => 
{s"ScalaReflection.schemaFor[A$i].nullable :: $s"})
--- End diff --

instead of having 2 list, shall we just keep a 
`Seq[ScalaReflection.Schema]` or `Seq[(DataType, Boolean)]`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-09 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r209127319
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -2149,28 +2149,29 @@ class Analyzer(
 
   case p => p transformExpressionsUp {
 
-case udf @ ScalaUDF(func, _, inputs, _, _, _, _) =>
-  val parameterTypes = ScalaReflection.getParameterTypes(func)
-  assert(parameterTypes.length == inputs.length)
-
-  // TODO: skip null handling for not-nullable primitive inputs 
after we can completely
-  // trust the `nullable` information.
-  // (cls, expr) => cls.isPrimitive && expr.nullable
-  val needsNullCheck = (cls: Class[_], expr: Expression) =>
-cls.isPrimitive && !expr.isInstanceOf[KnownNotNull]
-  val inputsNullCheck = parameterTypes.zip(inputs)
-.filter { case (cls, expr) => needsNullCheck(cls, expr) }
-.map { case (_, expr) => IsNull(expr) }
-.reduceLeftOption[Expression]((e1, e2) => Or(e1, e2))
-  // Once we add an `If` check above the udf, it is safe to mark 
those checked inputs
-  // as not nullable (i.e., wrap them with `KnownNotNull`), 
because the null-returning
-  // branch of `If` will be called if any of these checked inputs 
is null. Thus we can
-  // prevent this rule from being applied repeatedly.
-  val newInputs = parameterTypes.zip(inputs).map{ case (cls, expr) 
=>
-if (needsNullCheck(cls, expr)) KnownNotNull(expr) else expr }
-  inputsNullCheck
-.map(If(_, Literal.create(null, udf.dataType), 
udf.copy(children = newInputs)))
-.getOrElse(udf)
+case udf@ScalaUDF(func, _, inputs, _, _, _, _, nullableTypes) =>
+  if (nullableTypes.isEmpty) {
--- End diff --

This is probably the weak point: unless there is nullability info, don't do 
anything to the UDF plan, but, that's probably wrong in some cases


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-09 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r209127367
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
 ---
@@ -39,6 +39,7 @@ import org.apache.spark.sql.types.DataType
  * @param nullable  True if the UDF can return null value.
  * @param udfDeterministic  True if the UDF is deterministic. 
Deterministic UDF returns same result
  *  each time it is invoked with a particular 
input.
+ * @param nullableTypes which of the inputTypes are nullable (i.e. not 
primitive)
--- End diff --

The approach here is to capture at registration time whether the arg types 
are primitive, or nullable. Not a great way to record this, but might be the 
least hack for now


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org