[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r214567945 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -40,7 +41,7 @@ import org.apache.spark.sql.types.DataType case class UserDefinedFunction protected[sql] ( f: AnyRef, dataType: DataType, -inputTypes: Option[Seq[DataType]]) { +inputTypes: Option[Seq[ScalaReflection.Schema]]) { --- End diff -- (BTW it was PR https://github.com/apache/spark/pull/22259 that was merged) We can add back accessors, constructors, if it would make life easier for callers. But if this is protected, who are the callers of this code we're accommodating? maybe some hacky but important integration? We'd have to rename `inputTypes` and then add back an accessor with the old type. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r214560630 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -40,7 +41,7 @@ import org.apache.spark.sql.types.DataType case class UserDefinedFunction protected[sql] ( f: AnyRef, dataType: DataType, -inputTypes: Option[Seq[DataType]]) { +inputTypes: Option[Seq[ScalaReflection.Schema]]) { --- End diff -- ah i see, it's a case class, so we would need to keep the `def inputTypes(): Option[Seq[DataType]]` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r214560519 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -40,7 +41,7 @@ import org.apache.spark.sql.types.DataType case class UserDefinedFunction protected[sql] ( f: AnyRef, dataType: DataType, -inputTypes: Option[Seq[DataType]]) { +inputTypes: Option[Seq[ScalaReflection.Schema]]) { --- End diff -- But the constructor is `protected[sql]` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r214560313 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -40,7 +41,7 @@ import org.apache.spark.sql.types.DataType case class UserDefinedFunction protected[sql] ( f: AnyRef, dataType: DataType, -inputTypes: Option[Seq[DataType]]) { +inputTypes: Option[Seq[ScalaReflection.Schema]]) { --- End diff -- This is a stable API. Are we able to make this change in 2.4 instead of 3.0? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user adriaanm commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r213702881 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala --- @@ -164,19 +164,15 @@ abstract class ClassificationModel[FeaturesType, M <: ClassificationModel[Featur var outputData = dataset var numColsOutput = 0 if (getRawPredictionCol != "") { - val predictRawUDF = udf { (features: Any) => --- End diff -- No problem! Happy to help with the 2.12 upgrade. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r213696131 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala --- @@ -164,19 +164,15 @@ abstract class ClassificationModel[FeaturesType, M <: ClassificationModel[Featur var outputData = dataset var numColsOutput = 0 if (getRawPredictionCol != "") { - val predictRawUDF = udf { (features: Any) => --- End diff -- I apologize, that's my mistake. In the end it isn't related to TypeTags for Any and that is not a difference. Thanks for your input, I think we are close. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user srowen closed the pull request at: https://github.com/apache/spark/pull/22063 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user adriaanm commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r213593596 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala --- @@ -164,19 +164,15 @@ abstract class ClassificationModel[FeaturesType, M <: ClassificationModel[Featur var outputData = dataset var numColsOutput = 0 if (getRawPredictionCol != "") { - val predictRawUDF = udf { (features: Any) => --- End diff -- No idea, but in any case the new version seems nicer :-) Both 2.11 and 2.12 will happily generate a `typeTag` for `Any`, though, so that wouldn't immediately explain it. To see what was actually inferred, you could compile with `-Xprint:typer` (ideally after a full compile and then just making this file recompile incrementally). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user skonto commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r213575267 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala --- @@ -164,19 +164,15 @@ abstract class ClassificationModel[FeaturesType, M <: ClassificationModel[Featur var outputData = dataset var numColsOutput = 0 if (getRawPredictionCol != "") { - val predictRawUDF = udf { (features: Any) => --- End diff -- @adriaanm any more info? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212800597 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -3713,7 +3726,7 @@ object functions { | */ |def udf[$typeTags](f: Function$x[$types]): UserDefinedFunction = { | val ScalaReflection.Schema(dataType, nullable) = ScalaReflection.schemaFor[RT] - | val inputTypes = Try($inputTypes).toOption + | val inputTypes = Try($argSchema).toOption --- End diff -- @cloud-fan might be worth another look now. So, after making this change, I realize, maybe the whole reason it started failing was that I had moved the schema inference outside the Try(). Now it's back inside. Maybe that makes the whole problem go back to being silent. Did you mean you preferred tackling the problem directly and not suppressing the failure to infer a schema? I added udfInternal above for that. But maybe this isn't the best approach as user UDFs could fail for the same reason. Maybe I need to back this whole thing out after all, now that I understand what's happening after your comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212504250 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType case class UserDefinedFunction protected[sql] ( f: AnyRef, dataType: DataType, -inputTypes: Option[Seq[DataType]]) { +inputTypes: Option[Seq[DataType]], +nullableTypes: Option[Seq[Boolean]] = None) { --- End diff -- +1 to all your comments. I'm overhauling this whole PR and will force push with a rebase once it seems to basically work. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212499322 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType case class UserDefinedFunction protected[sql] ( f: AnyRef, dataType: DataType, -inputTypes: Option[Seq[DataType]]) { +inputTypes: Option[Seq[DataType]], +nullableTypes: Option[Seq[Boolean]] = None) { --- End diff -- or `Option[Seq[ScalaReflection.Schema]]` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212499164 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala --- @@ -57,8 +59,21 @@ case class ScalaUDF( children: Seq[Expression], inputTypes: Seq[DataType], udfName: Option[String]) = { -this( - function, dataType, children, inputTypes, udfName, nullable = true, udfDeterministic = true) +this(function, dataType, children, inputTypes, udfName, + nullable = true, udfDeterministic = true, nullableTypes = Nil) + } + + // Constructor from Spark 2.3 --- End diff -- By convention, everything under catalyst package is private, so compatibility is not a concern here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212488387 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala --- @@ -164,19 +164,15 @@ abstract class ClassificationModel[FeaturesType, M <: ClassificationModel[Featur var outputData = dataset var numColsOutput = 0 if (getRawPredictionCol != "") { - val predictRawUDF = udf { (features: Any) => --- End diff -- @skonto @lrytz this might be of interest. Don't think it's a Scala issue per se but just checking if that behavior change makes sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212445066 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala --- @@ -57,8 +59,21 @@ case class ScalaUDF( children: Seq[Expression], inputTypes: Seq[DataType], udfName: Option[String]) = { -this( - function, dataType, children, inputTypes, udfName, nullable = true, udfDeterministic = true) +this(function, dataType, children, inputTypes, udfName, + nullable = true, udfDeterministic = true, nullableTypes = Nil) + } + + // Constructor from Spark 2.3 --- End diff -- Yeah, messy. The constructor and class are public so felt it was worth erring on the side of compatibility. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212393458 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -85,9 +85,8 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is * out of integer range or contains a fractional part. */ - protected[recommendation] val checkedCast = udf { (n: Any) => + protected[recommendation] val checkedCast = udf { n: AnyRef => --- End diff -- This doesn't even actually work, now that I dig further. Using `Number` doesn't work either. There are a number of UDFs that fail now in MLlib unfortunately. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212393299 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType case class UserDefinedFunction protected[sql] ( f: AnyRef, dataType: DataType, -inputTypes: Option[Seq[DataType]]) { +inputTypes: Option[Seq[DataType]], +nullableTypes: Option[Seq[Boolean]] = None) { --- End diff -- I was hoping to minimize the change in the signature, but yeah it could be `Option[Seq[(DataType, Boolean)]]`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212393072 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType case class UserDefinedFunction protected[sql] ( f: AnyRef, dataType: DataType, -inputTypes: Option[Seq[DataType]]) { +inputTypes: Option[Seq[DataType]], +nullableTypes: Option[Seq[Boolean]] = None) { + + // Constructor from Spark 2.3.0 for binary compatibility --- End diff -- It was just for MiMa, but I could also suppress the warning --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212393002 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -375,8 +375,11 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { import org.apache.spark.sql.functions.{rand, udf} val c = Column(col) val r = rand(seed) -val f = udf { (stratum: Any, x: Double) => - x < fractions.getOrElse(stratum.asInstanceOf[T], 0.0) +// Hack to get around the fact that type T is Any and we can't use a UDF whose arg +// is Any. Convert everything to a string rep. --- End diff -- You are right that the change I made here is not bulletproof. Unfortunately there are several more problems like this. Anywhere there's a UDF on `Row` it now fails, and workarounds are ugly. I like your idea, let me work on that. Because the alternative I've been working on is driving me nuts. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212392363 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala --- @@ -164,19 +164,15 @@ abstract class ClassificationModel[FeaturesType, M <: ClassificationModel[Featur var outputData = dataset var numColsOutput = 0 if (getRawPredictionCol != "") { - val predictRawUDF = udf { (features: Any) => --- End diff -- Thanks for your review @cloud-fan , I could really use your input here. That's a good find. It _may_ be that we want to explicitly support UDFs where a schema isn't available -- see below. But I agree I'd rather not. It gets kind of messy though. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212354787 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -375,8 +375,11 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { import org.apache.spark.sql.functions.{rand, udf} val c = Column(col) val r = rand(seed) -val f = udf { (stratum: Any, x: Double) => - x < fractions.getOrElse(stratum.asInstanceOf[T], 0.0) +// Hack to get around the fact that type T is Any and we can't use a UDF whose arg +// is Any. Convert everything to a string rep. --- End diff -- I feel udf using `Any` as input type is rare but a valid use case. Sometimes they just want to accept any input type. How about we create a few `udfInternal` methods that takes `Any` as inputs? e.g. ``` def udfInternal[R: TypeTag](f: Function1[Any, R]): UserDefinedFunction = { val ScalaReflection.Schema(dataType, nullable) = ScalaReflection.schemaFor[R] val udf = UserDefinedFunction(f, dataType, Nil) if (nullable) udf else udf.asNonNullable() } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212353000 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType case class UserDefinedFunction protected[sql] ( f: AnyRef, dataType: DataType, -inputTypes: Option[Seq[DataType]]) { +inputTypes: Option[Seq[DataType]], +nullableTypes: Option[Seq[Boolean]] = None) { --- End diff -- again, can we combine it with the data types? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212352751 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType case class UserDefinedFunction protected[sql] ( f: AnyRef, dataType: DataType, -inputTypes: Option[Seq[DataType]]) { +inputTypes: Option[Seq[DataType]], +nullableTypes: Option[Seq[Boolean]] = None) { + + // Constructor from Spark 2.3.0 for binary compatibility --- End diff -- why add this? `UserDefinedFunction` is public but its constructor is not. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212349025 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType case class UserDefinedFunction protected[sql] ( f: AnyRef, dataType: DataType, -inputTypes: Option[Seq[DataType]]) { +inputTypes: Option[Seq[DataType]], +nullableTypes: Option[Seq[Boolean]] = None) { + + // Constructor from Spark 2.3.0 for binary compatibility --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212348950 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -375,8 +375,11 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { import org.apache.spark.sql.functions.{rand, udf} val c = Column(col) val r = rand(seed) -val f = udf { (stratum: Any, x: Double) => - x < fractions.getOrElse(stratum.asInstanceOf[T], 0.0) +// Hack to get around the fact that type T is Any and we can't use a UDF whose arg +// is Any. Convert everything to a string rep. --- End diff -- I'm not sure this is safe to do. Maybe `a == b` is different from `a.toString == b.toString`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212347965 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala --- @@ -57,8 +59,21 @@ case class ScalaUDF( children: Seq[Expression], inputTypes: Seq[DataType], udfName: Option[String]) = { -this( - function, dataType, children, inputTypes, udfName, nullable = true, udfDeterministic = true) +this(function, dataType, children, inputTypes, udfName, + nullable = true, udfDeterministic = true, nullableTypes = Nil) + } + + // Constructor from Spark 2.3 --- End diff -- I'm not sure we want to keep doing this. cc @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212346762 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -85,9 +85,8 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is * out of integer range or contains a fractional part. */ - protected[recommendation] val checkedCast = udf { (n: Any) => + protected[recommendation] val checkedCast = udf { n: AnyRef => --- End diff -- I'm fine with this workaround, but I think we should create an expression for this `checkedCast`. The current UDF doesn't work well with inputs of different types. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212345138 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala --- @@ -164,19 +164,15 @@ abstract class ClassificationModel[FeaturesType, M <: ClassificationModel[Featur var outputData = dataset var numColsOutput = 0 if (getRawPredictionCol != "") { - val predictRawUDF = udf { (features: Any) => --- End diff -- I looked into this, and now I understand why it worked before. Scala 2.11 somehow can generate type tag for `Any`, then Spark gets the input schema from type tag `Try(ScalaReflection.schemaFor(typeTag[A1]).dataType :: Nil).toOption`. It will fail and input schema will be None, so no type check will be applied later. I think it makes more sense to specify the type and ask Spark to do type check. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user TomaszGaweda commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r209713521 --- Diff: project/MimaExcludes.scala --- @@ -36,6 +36,11 @@ object MimaExcludes { // Exclude rules for 2.4.x lazy val v24excludes = v23excludes ++ Seq( +// [SPARK-25044] Address translation of LMF closure primitive args to Object in Scala 2.1 --- End diff -- Nit: Wrong Scala version --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r209426781 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -211,9 +211,7 @@ abstract class PredictionModel[FeaturesType, M <: PredictionModel[FeaturesType, } protected def transformImpl(dataset: Dataset[_]): DataFrame = { -val predictUDF = udf { (features: Any) => --- End diff -- I'm really surprised that this worked before... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r209136524 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -114,6 +114,7 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends val types = (1 to x).foldRight("RT")((i, s) => {s"A$i, $s"}) val typeTags = (1 to x).map(i => s"A$i: TypeTag").foldLeft("RT: TypeTag")(_ + ", " + _) val inputTypes = (1 to x).foldRight("Nil")((i, s) => {s"ScalaReflection.schemaFor[A$i].dataType :: $s"}) + val nullableTypes = (1 to x).foldRight("Nil")((i, s) => {s"ScalaReflection.schemaFor[A$i].nullable :: $s"}) --- End diff -- Yeah that can be optimized. I'll fix the MiMa issue too by restoring a constructor. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r209135836 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -114,6 +114,7 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends val types = (1 to x).foldRight("RT")((i, s) => {s"A$i, $s"}) val typeTags = (1 to x).map(i => s"A$i: TypeTag").foldLeft("RT: TypeTag")(_ + ", " + _) val inputTypes = (1 to x).foldRight("Nil")((i, s) => {s"ScalaReflection.schemaFor[A$i].dataType :: $s"}) + val nullableTypes = (1 to x).foldRight("Nil")((i, s) => {s"ScalaReflection.schemaFor[A$i].nullable :: $s"}) --- End diff -- instead of having 2 list, shall we just keep a `Seq[ScalaReflection.Schema]` or `Seq[(DataType, Boolean)]`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r209127319 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -2149,28 +2149,29 @@ class Analyzer( case p => p transformExpressionsUp { -case udf @ ScalaUDF(func, _, inputs, _, _, _, _) => - val parameterTypes = ScalaReflection.getParameterTypes(func) - assert(parameterTypes.length == inputs.length) - - // TODO: skip null handling for not-nullable primitive inputs after we can completely - // trust the `nullable` information. - // (cls, expr) => cls.isPrimitive && expr.nullable - val needsNullCheck = (cls: Class[_], expr: Expression) => -cls.isPrimitive && !expr.isInstanceOf[KnownNotNull] - val inputsNullCheck = parameterTypes.zip(inputs) -.filter { case (cls, expr) => needsNullCheck(cls, expr) } -.map { case (_, expr) => IsNull(expr) } -.reduceLeftOption[Expression]((e1, e2) => Or(e1, e2)) - // Once we add an `If` check above the udf, it is safe to mark those checked inputs - // as not nullable (i.e., wrap them with `KnownNotNull`), because the null-returning - // branch of `If` will be called if any of these checked inputs is null. Thus we can - // prevent this rule from being applied repeatedly. - val newInputs = parameterTypes.zip(inputs).map{ case (cls, expr) => -if (needsNullCheck(cls, expr)) KnownNotNull(expr) else expr } - inputsNullCheck -.map(If(_, Literal.create(null, udf.dataType), udf.copy(children = newInputs))) -.getOrElse(udf) +case udf@ScalaUDF(func, _, inputs, _, _, _, _, nullableTypes) => + if (nullableTypes.isEmpty) { --- End diff -- This is probably the weak point: unless there is nullability info, don't do anything to the UDF plan, but, that's probably wrong in some cases --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r209127367 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala --- @@ -39,6 +39,7 @@ import org.apache.spark.sql.types.DataType * @param nullable True if the UDF can return null value. * @param udfDeterministic True if the UDF is deterministic. Deterministic UDF returns same result * each time it is invoked with a particular input. + * @param nullableTypes which of the inputTypes are nullable (i.e. not primitive) --- End diff -- The approach here is to capture at registration time whether the arg types are primitive, or nullable. Not a great way to record this, but might be the least hack for now --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org