[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/9766 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r83159493 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -414,6 +418,84 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends // /** + * Register a Java UDF class using reflection, for use from pyspark + * + * @param name udf name + * @param className fully qualified class name of udf + * @param returnDataType return type of udf. If it is null, spark would try to infer + *via reflection. + */ + def registerJava(name: String, className: String, returnDataType: DataType): Unit = { + +try { + val clazz = Utils.classForName(className) + val udfInterfaces = clazz.getGenericInterfaces +.filter(_.isInstanceOf[ParameterizedType]) +.map(_.asInstanceOf[ParameterizedType]) +.filter(e => e.getRawType.isInstanceOf[Class[_]] && e.getRawType.asInstanceOf[Class[_]].getCanonicalName.startsWith("org.apache.spark.sql.api.java.UDF")) + if (udfInterfaces.length == 0) { +throw new IOException(s"UDF class ${className} doesn't implement any UDF interface") + } else if (udfInterfaces.length > 1) { +throw new IOException(s"It is invalid to implement multiple UDF interfaces, UDF class ${className}") + } else { +try { + val udf = clazz.newInstance() + val udfReturnType = udfInterfaces(0).getActualTypeArguments.last + var returnType = returnDataType + if (returnType == null) { +if (udfReturnType.isInstanceOf[Class[_]]) { + returnType = udfReturnType.asInstanceOf[Class[_]].getCanonicalName match { --- End diff -- Thanks for the hintï¼ fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r83159473 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -414,6 +418,84 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends // /** + * Register a Java UDF class using reflection, for use from pyspark + * + * @param name udf name + * @param className fully qualified class name of udf + * @param returnDataType return type of udf. If it is null, spark would try to infer + *via reflection. + */ + def registerJava(name: String, className: String, returnDataType: DataType): Unit = { --- End diff -- Fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r83159425 --- Diff: python/pyspark/sql/context.py --- @@ -202,6 +202,32 @@ def registerFunction(self, name, f, returnType=StringType()): """ self.sparkSession.catalog.registerFunction(name, f, returnType) +@ignore_unicode_prefix +@since(2.1) +def registerJavaFunction(self, name, javaClassName, returnType=None): +"""Register a java UDF so it can be used in SQL statements. + +In addition to a name and the function itself, the return type can be optionally specified. +When the return type is not given it would infer the returnType via reflection. --- End diff -- Fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r83134970 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -414,6 +418,84 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends // /** --- End diff -- I can turn it on, but it would make the function less readable, especially for the following statements where it beyond line length limitation. ``` case 14 => register(name, udf.asInstanceOf[UDF13[_, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r83057793 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -414,6 +418,84 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends // /** + * Register a Java UDF class using reflection, for use from pyspark + * + * @param name udf name + * @param className fully qualified class name of udf + * @param returnDataType return type of udf. If it is null, spark would try to infer + *via reflection. + */ + def registerJava(name: String, className: String, returnDataType: DataType): Unit = { + +try { + val clazz = Utils.classForName(className) + val udfInterfaces = clazz.getGenericInterfaces +.filter(_.isInstanceOf[ParameterizedType]) +.map(_.asInstanceOf[ParameterizedType]) +.filter(e => e.getRawType.isInstanceOf[Class[_]] && e.getRawType.asInstanceOf[Class[_]].getCanonicalName.startsWith("org.apache.spark.sql.api.java.UDF")) + if (udfInterfaces.length == 0) { +throw new IOException(s"UDF class ${className} doesn't implement any UDF interface") + } else if (udfInterfaces.length > 1) { +throw new IOException(s"It is invalid to implement multiple UDF interfaces, UDF class ${className}") + } else { +try { + val udf = clazz.newInstance() + val udfReturnType = udfInterfaces(0).getActualTypeArguments.last + var returnType = returnDataType + if (returnType == null) { +if (udfReturnType.isInstanceOf[Class[_]]) { + returnType = udfReturnType.asInstanceOf[Class[_]].getCanonicalName match { --- End diff -- Can we use [`JavaTypeInference`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala) here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r83056944 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -414,6 +418,84 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends // /** --- End diff -- It would be nice to turn style back on here since most of this function is not auto generated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r83056792 --- Diff: sql/core/src/main/java/org/apache/spark/sql/test/JavaStringLength.java --- @@ -0,0 +1,30 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.test; + +import org.apache.spark.sql.api.java.UDF1; + +/** + * It is used for register Java UDF from PySpark + */ +public class JavaStringLength implements UDF1 { --- End diff -- Could this be moved to `src/test`? It would be better to not distribute it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r83056607 --- Diff: python/pyspark/sql/context.py --- @@ -202,6 +202,32 @@ def registerFunction(self, name, f, returnType=StringType()): """ self.sparkSession.catalog.registerFunction(name, f, returnType) +@ignore_unicode_prefix +@since(2.1) +def registerJavaFunction(self, name, javaClassName, returnType=None): +"""Register a java UDF so it can be used in SQL statements. + +In addition to a name and the function itself, the return type can be optionally specified. +When the return type is not given it would infer the returnType via reflection. --- End diff -- nit: its a little odd to mix `return type` with `returnType`. Perhaps, "When the return type is not specified we attempt to infer it using reflection" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r83057132 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -414,6 +418,84 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends // /** + * Register a Java UDF class using reflection, for use from pyspark + * + * @param name udf name + * @param className fully qualified class name of udf + * @param returnDataType return type of udf. If it is null, spark would try to infer + *via reflection. + */ + def registerJava(name: String, className: String, returnDataType: DataType): Unit = { --- End diff -- Is it possible to make this non-public? I believe we do this in other cases for code only called from python. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r82969210 --- Diff: python/pyspark/sql/context.py --- @@ -202,6 +202,10 @@ def registerFunction(self, name, f, returnType=StringType()): """ self.sparkSession.catalog.registerFunction(name, f, returnType) +def registerJavaFunction(self, name, javaClassName, returnType): --- End diff -- Fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r82969338 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -17,9 +17,15 @@ package org.apache.spark.sql + +import java.io.IOException +import java.util.{List => JList, Map => JMap} + import scala.reflect.runtime.universe.TypeTag import scala.util.Try +import sun.reflect.generics.reflectiveObjects.ParameterizedTypeImpl --- End diff -- Fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r82969355 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -412,6 +419,63 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends // /** + * Register a Java UDF class + * @param name + * @param className + * @param returnType + */ + def registerJava(name: String, className: String, returnType: DataType): Unit = { + +try { + // scalastyle:off classforname --- End diff -- Fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r82876529 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -412,6 +419,63 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends // /** + * Register a Java UDF class + * @param name + * @param className + * @param returnType + */ + def registerJava(name: String, className: String, returnType: DataType): Unit = { --- End diff -- +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r82876442 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -412,6 +419,63 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends // /** + * Register a Java UDF class + * @param name + * @param className + * @param returnType + */ + def registerJava(name: String, className: String, returnType: DataType): Unit = { + +try { + // scalastyle:off classforname --- End diff -- This style rule is here to prevent misuse. Is there a reason we aren't using our utility functions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r82876419 --- Diff: python/pyspark/sql/context.py --- @@ -202,6 +202,26 @@ def registerFunction(self, name, f, returnType=StringType()): """ self.sparkSession.catalog.registerFunction(name, f, returnType) +@ignore_unicode_prefix +@since(2.1) +def registerJavaFunction(self, name, javaClassName, returnType=StringType()): +"""Register a java UDF so it can be used in SQL statements. + +In addition to a name and the function itself, the return type can be optionally specified. +When the return type is not given it default to a string and conversion will automatically --- End diff -- Where does this conversion happen? Are we sure that it works (given there are no tests that I see). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r82876509 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -17,9 +17,15 @@ package org.apache.spark.sql + +import java.io.IOException +import java.util.{List => JList, Map => JMap} + import scala.reflect.runtime.universe.TypeTag import scala.util.Try +import sun.reflect.generics.reflectiveObjects.ParameterizedTypeImpl --- End diff -- Is this JVM specific? What is this being used for? Is there another way? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r82450939 --- Diff: python/pyspark/sql/context.py --- @@ -202,6 +202,10 @@ def registerFunction(self, name, f, returnType=StringType()): """ self.sparkSession.catalog.registerFunction(name, f, returnType) +def registerJavaFunction(self, name, javaClassName, returnType): --- End diff -- +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r81877809 --- Diff: python/pyspark/sql/context.py --- @@ -202,6 +202,10 @@ def registerFunction(self, name, f, returnType=StringType()): """ self.sparkSession.catalog.registerFunction(name, f, returnType) +def registerJavaFunction(self, name, javaClassName, returnType): --- End diff -- It would be good to have tests that call this from the Python side, also it probably needs a docstring and since annotation as well (see registerFunction since you take similar params). It might also make sense to have the same default return type of StringType as the current registerFunction but that is more a matter of taste. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r73413872 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -412,6 +419,63 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends // /** + * Register a Java UDF class + * @param name + * @param className + * @param returnType + */ + def registerJava(name: String, className: String, returnType: DataType): Unit = { --- End diff -- I don't know if we want to expose this API for general use - maybe make it private so that it can only be called from Python? And maybe update the scaladoc to something like "Register a Java UDF class using reflection - for use from Python". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org