[GitHub] [spark] rdblue commented on a change in pull request #25024: [SPARK-27296][SQL] User Defined Aggregators that do not ser/de on each input row

2019-10-21 Thread GitBox
rdblue commented on a change in pull request #25024: [SPARK-27296][SQL] User 
Defined Aggregators that do not ser/de on each input row
URL: https://github.com/apache/spark/pull/25024#discussion_r337229142
 
 

 ##
 File path: sql/core/src/test/scala/org/apache/spark/sql/SQLContextSuite.scala
 ##
 @@ -51,7 +51,7 @@ class SQLContextSuite extends SparkFunSuite with 
SharedSparkContext {
 
 // UDF should not be shared
 def myadd(a: Int, b: Int): Int = a + b
-session1.udf.register[Int, Int, Int]("myadd", myadd)
+session1.udf.register[Int, Int, Int]("myadd", myadd(_, _))
 
 Review comment:
   It would be nice to know what's going on here with Scala, but I don't think 
this is worth breaking existing code (even if we _can_ for 3.0). Unless we find 
out what the underlying problem is, I think we should use `registerAggregator`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rdblue commented on a change in pull request #25024: [SPARK-27296][SQL] User Defined Aggregators that do not ser/de on each input row

2019-10-18 Thread GitBox
rdblue commented on a change in pull request #25024: [SPARK-27296][SQL] User 
Defined Aggregators that do not ser/de on each input row
URL: https://github.com/apache/spark/pull/25024#discussion_r336709848
 
 

 ##
 File path: sql/core/src/test/scala/org/apache/spark/sql/SQLContextSuite.scala
 ##
 @@ -51,7 +51,7 @@ class SQLContextSuite extends SparkFunSuite with 
SharedSparkContext {
 
 // UDF should not be shared
 def myadd(a: Int, b: Int): Int = a + b
-session1.udf.register[Int, Int, Int]("myadd", myadd)
+session1.udf.register[Int, Int, Int]("myadd", myadd(_, _))
 
 Review comment:
   So is the problem that this has 3 type parameters, so the method can't be 
automatically converted to `Function2[R, A1, A2]`? That doesn't make much sense 
to me since a method clearly isn't an `Aggregator[IN, BUF, OUT]`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rdblue commented on a change in pull request #25024: [SPARK-27296][SQL] User Defined Aggregators that do not ser/de on each input row

2019-10-18 Thread GitBox
rdblue commented on a change in pull request #25024: [SPARK-27296][SQL] User 
Defined Aggregators that do not ser/de on each input row
URL: https://github.com/apache/spark/pull/25024#discussion_r336707125
 
 

 ##
 File path: sql/core/src/test/scala/org/apache/spark/sql/SQLContextSuite.scala
 ##
 @@ -51,7 +51,7 @@ class SQLContextSuite extends SparkFunSuite with 
SharedSparkContext {
 
 // UDF should not be shared
 def myadd(a: Int, b: Int): Int = a + b
-session1.udf.register[Int, Int, Int]("myadd", myadd)
+session1.udf.register[Int, Int, Int]("myadd", myadd(_, _))
 
 Review comment:
   Not sure I follow why it is confusing Scala. Why would adding a new method 
that accepts Aggregator cause another overload to stop working? Aggregator 
isn't a Function2


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rdblue commented on a change in pull request #25024: [SPARK-27296][SQL] User Defined Aggregators that do not ser/de on each input row

2019-10-18 Thread GitBox
rdblue commented on a change in pull request #25024: [SPARK-27296][SQL] User 
Defined Aggregators that do not ser/de on each input row
URL: https://github.com/apache/spark/pull/25024#discussion_r336705027
 
 

 ##
 File path: sql/core/src/test/scala/org/apache/spark/sql/SQLContextSuite.scala
 ##
 @@ -51,7 +51,7 @@ class SQLContextSuite extends SparkFunSuite with 
SharedSparkContext {
 
 // UDF should not be shared
 def myadd(a: Int, b: Int): Int = a + b
-session1.udf.register[Int, Int, Int]("myadd", myadd)
+session1.udf.register[Int, Int, Int]("myadd", myadd(_, _))
 
 Review comment:
   Why was this change needed? Is this a breaking change for existing code?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rdblue commented on a change in pull request #25024: [SPARK-27296][SQL] User Defined Aggregators that do not ser/de on each input row

2019-10-18 Thread GitBox
rdblue commented on a change in pull request #25024: [SPARK-27296][SQL] User 
Defined Aggregators that do not ser/de on each input row
URL: https://github.com/apache/spark/pull/25024#discussion_r336669002
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala
 ##
 @@ -450,3 +452,165 @@ case class ScalaUDAF(
 
   override def nodeName: String = udaf.getClass.getSimpleName
 }
+
+/**
+ * The internal wrapper used to hook a [[UserDefinedImperativeAggregator]] 
`udia` in the
+ * internal aggregation code path.
+ */
+case class ScalaUDIA[T](
+children: Seq[Expression],
+udia: UserDefinedImperativeAggregator[T],
+mutableAggBufferOffset: Int = 0,
+inputAggBufferOffset: Int = 0)
+  extends TypedImperativeAggregate[T]
+  with NonSQLExpression
+  with UserDefinedExpression
+  with ImplicitCastInputTypes
+  with Logging {
+
+  def dataType: DataType = udia.resultType
+
+  val inputTypes: Seq[DataType] = udia.inputSchema.map(_.dataType)
+
+  def nullable: Boolean = true
 
 Review comment:
   I think this is going to depend on the function implementation and should be 
delegated.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rdblue commented on a change in pull request #25024: [SPARK-27296][SQL] User Defined Aggregators that do not ser/de on each input row

2019-10-18 Thread GitBox
rdblue commented on a change in pull request #25024: [SPARK-27296][SQL] User 
Defined Aggregators that do not ser/de on each input row
URL: https://github.com/apache/spark/pull/25024#discussion_r336668483
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala
 ##
 @@ -450,3 +452,165 @@ case class ScalaUDAF(
 
   override def nodeName: String = udaf.getClass.getSimpleName
 }
+
+/**
+ * The internal wrapper used to hook a [[UserDefinedImperativeAggregator]] 
`udia` in the
+ * internal aggregation code path.
+ */
+case class ScalaUDIA[T](
+children: Seq[Expression],
+udia: UserDefinedImperativeAggregator[T],
+mutableAggBufferOffset: Int = 0,
+inputAggBufferOffset: Int = 0)
+  extends TypedImperativeAggregate[T]
+  with NonSQLExpression
+  with UserDefinedExpression
+  with ImplicitCastInputTypes
+  with Logging {
+
+  def dataType: DataType = udia.resultType
+
+  val inputTypes: Seq[DataType] = udia.inputSchema.map(_.dataType)
 
 Review comment:
   As you said, I think this is important for performance. It also affects the 
output type. One valid way to handle null input values is to make the output 
null. For example, `avg(1, null, 3)` is `null`. If input values are guaranteed 
to be non-null, then functions like this would guarantee a non-null result and 
wouldn't need to check for null values.
   
   I think it's important that this be a full struct type, not just the 
individual data types.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rdblue commented on a change in pull request #25024: [SPARK-27296][SQL] User Defined Aggregators that do not ser/de on each input row

2019-09-30 Thread GitBox
rdblue commented on a change in pull request #25024: [SPARK-27296][SQL] User 
Defined Aggregators that do not ser/de on each input row
URL: https://github.com/apache/spark/pull/25024#discussion_r329795140
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala
 ##
 @@ -450,3 +452,165 @@ case class ScalaUDAF(
 
   override def nodeName: String = udaf.getClass.getSimpleName
 }
+
+/**
+ * The internal wrapper used to hook a [[UserDefinedImperativeAggregator]] 
`udia` in the
+ * internal aggregation code path.
+ */
+case class ScalaUDIA[T](
+children: Seq[Expression],
+udia: UserDefinedImperativeAggregator[T],
+mutableAggBufferOffset: Int = 0,
+inputAggBufferOffset: Int = 0)
+  extends TypedImperativeAggregate[T]
+  with NonSQLExpression
+  with UserDefinedExpression
+  with ImplicitCastInputTypes
+  with Logging {
+
+  def dataType: DataType = udia.resultType
+
+  val inputTypes: Seq[DataType] = udia.inputSchema.map(_.dataType)
 
 Review comment:
   Are inputs assumed to be nullable?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rdblue commented on a change in pull request #25024: [SPARK-27296][SQL] User Defined Aggregators that do not ser/de on each input row

2019-09-30 Thread GitBox
rdblue commented on a change in pull request #25024: [SPARK-27296][SQL] User 
Defined Aggregators that do not ser/de on each input row
URL: https://github.com/apache/spark/pull/25024#discussion_r329795215
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala
 ##
 @@ -450,3 +452,165 @@ case class ScalaUDAF(
 
   override def nodeName: String = udaf.getClass.getSimpleName
 }
+
+/**
+ * The internal wrapper used to hook a [[UserDefinedImperativeAggregator]] 
`udia` in the
+ * internal aggregation code path.
+ */
+case class ScalaUDIA[T](
+children: Seq[Expression],
+udia: UserDefinedImperativeAggregator[T],
+mutableAggBufferOffset: Int = 0,
+inputAggBufferOffset: Int = 0)
+  extends TypedImperativeAggregate[T]
+  with NonSQLExpression
+  with UserDefinedExpression
+  with ImplicitCastInputTypes
+  with Logging {
+
+  def dataType: DataType = udia.resultType
+
+  val inputTypes: Seq[DataType] = udia.inputSchema.map(_.dataType)
+
+  def nullable: Boolean = true
 
 Review comment:
   Should this also be delegated to the UDIA?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org