[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/20223#discussion_r160991236 --- Diff: launcher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java --- @@ -71,15 +71,16 @@ public void stop() { @Override public synchronized void disconnect() { if (!disposed) { --- End diff -- `if(!isDisposed())` ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/20223#discussion_r160989352 --- Diff: launcher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java --- @@ -91,10 +92,15 @@ LauncherConnection getConnection() { return connection; } - boolean isDisposed() { + synchronized boolean isDisposed() { --- End diff -- why do we need `synchronized` here --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20226: [SPARK-23034][SQL][UI] Display tablename for `Hiv...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20226#discussion_r161006284 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala --- @@ -62,6 +62,8 @@ case class HiveTableScanExec( override def conf: SQLConf = sparkSession.sessionState.conf + override def nodeName: String = s"${super.nodeName} (${relation.tableMeta.qualifiedName})" --- End diff -- Our `DataSourceScanExec` is using `unquotedString` in nodeName. We need to make these LeafNode consistent. Could you check all the other `LeafExecNode`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20203: [SPARK-22577] [core] executor page blacklist status shou...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/20203 thanks for working on this, I'm going to try this out and do further review. Did you test for application failures and on the history server? cc @squito since he had some comments on the jira. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20227: [SPARK-23035][SQL] Fix warning: TEMPORARY TABLE ....
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20227#discussion_r161003528 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala --- @@ -33,6 +33,9 @@ class TableAlreadyExistsException(db: String, table: String) class TempTableAlreadyExistsException(table: String) extends AnalysisException(s"Temporary table '$table' already exists") +class TempViewAlreadyExistsException(table: String) + extends AnalysisException(s"Temporary view '$table' already exists") --- End diff -- We do not want to introduce a new exception type. In contrast, we planned to remove all these exception types because PySpark might output a confusing error message. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/20206 cc @cloud-fan @gengliangwang for review --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20228: [SPARK-23036][SQL] Add withGlobalTempView for testing an...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20228 **[Test build #85976 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85976/testReport)** for PR 20228 at commit [`fffd109`](https://github.com/apache/spark/commit/fffd109e8c084f9a4d63840bf761364f1ede5dc9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20237 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85974/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20227: [SPARK-23035][SQL] Fix warning: TEMPORARY TABLE ....
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20227#discussion_r161003157 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala --- @@ -33,6 +33,9 @@ class TableAlreadyExistsException(db: String, table: String) class TempTableAlreadyExistsException(table: String) extends AnalysisException(s"Temporary table '$table' already exists") --- End diff -- You just need to change `Temporary table` to `Temporary view`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20237 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20237 **[Test build #85974 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85974/testReport)** for PR 20237 at commit [`d2cfed3`](https://github.com/apache/spark/commit/d2cfed308d343fb55c5fd7c0d30bcbb987948632). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20228: [SPARK-23036][SQL] Add withGlobalTempView for testing an...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20228 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20228: [SPARK-23036][SQL] Add withGlobalTempView for testing an...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20228 @xubo245 This is only for TEST? Then, add the tag [TEST] in your title too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20194: [SPARK-22999][SQL]'show databases like command' c...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20194#discussion_r161000312 --- Diff: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 --- @@ -141,7 +141,7 @@ statement (LIKE? pattern=STRING)? #showTables | SHOW TABLE EXTENDED ((FROM | IN) db=identifier)? LIKE pattern=STRING partitionSpec? #showTable -| SHOW DATABASES (LIKE pattern=STRING)? #showDatabases +| SHOW DATABASES (LIKE? pattern=STRING)? #showDatabases --- End diff -- Impala is the only reference I can find https://www.cloudera.com/documentation/enterprise/5-10-x/topics/impala_show.html To be consistent with the other SHOW function, we can make LIKE optional? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20219 `NullType` is not well supported in almost all the data sources. We did not mention it in our doc https://spark.apache.org/docs/latest/sql-programming-guide.html cc @cloud-fan @marmbrus @rxin @sameeragarwal Any comment about this support? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20217 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85971/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20217 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpark
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20217 **[Test build #85971 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85971/testReport)** for PR 20217 at commit [`83776d6`](https://github.com/apache/spark/commit/83776d643c92a9354605320c20313a0380048388). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20237 @HyukjinKwon Thanks! I think this is good. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20189: [SPARK-22975][SS] MetricsReporter should not throw excep...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20189 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20189: [SPARK-22975][SS] MetricsReporter should not throw excep...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20189 @zsxwing the test you suggested fails on Jenkins while locally it passes. I am no sure about the reason since I cannot reproduce it. I had the same test error in my previous trials. Previously the problem was that running other test cases before this one produced the issue. May you please advice what to do? Can I revert to my previous test? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20219 **[Test build #85975 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85975/testReport)** for PR 20219 at commit [`27f0e14`](https://github.com/apache/spark/commit/27f0e143021a880787134cce7e672709449c828f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20233: [SPARK-23043][BUILD] Upgrade json4s to 3.5.3
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20233 **[Test build #4035 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4035/testReport)** for PR 20233 at commit [`e9cf606`](https://github.com/apache/spark/commit/e9cf6061a883abbee77ce50f5aa15416f7b86028). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20189: [SPARK-22975][SS] MetricsReporter should not throw excep...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20189 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85967/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20189: [SPARK-22975][SS] MetricsReporter should not throw excep...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20189 **[Test build #85967 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85967/testReport)** for PR 20189 at commit [`c67e07b`](https://github.com/apache/spark/commit/c67e07b66726d57c2b7a3462b226439518d9ebed). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20237 **[Test build #85974 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85974/testReport)** for PR 20237 at commit [`d2cfed3`](https://github.com/apache/spark/commit/d2cfed308d343fb55c5fd7c0d30bcbb987948632). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20237 Hey @gatorsmile, @ueshin, @BryanCutler and @icexelloss. Let's fix this by clarifying it to avoid potential confusion for now and clear up SPARK-22216's subtasks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of ...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/20237 [SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each batch within scalar Pandas UDF ## What changes were proposed in this pull request? This PR proposes to add a note that saying the length of a scalar Pandas UDF's `Series` is not of the whole input column but of the batch. We are fine for a group map UDF because the usage is different from our typical UDF but scalar UDFs might cause confusion with the normal UDF. For example, please consider this example: ```python from pyspark.sql.functions import pandas_udf, col, lit df = spark.range(1) f = pandas_udf(lambda x, y: len(x) + y, LongType()) df.select(f(lit('text'), col('id'))).show() ``` ``` +--+ |(text, id)| +--+ | 1| +--+ ``` ```python from pyspark.sql.functions import udf, col, lit df = spark.range(1) f = udf(lambda x, y: len(x) + y, "long") df.select(f(lit('text'), col('id'))).show() ``` ``` +--+ |(text, id)| +--+ | 4| +--+ ``` ## How was this patch tested? Manually built the doc and checked the output. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-22980 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20237.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20237 commit d2cfed308d343fb55c5fd7c0d30bcbb987948632 Author: hyukjinkwonDate: 2018-01-11T15:31:05Z Clarify the length of each series is of each batch within scalar Pandas UDF --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20219: [SPARK-23025][SQL] Support Null type in scala ref...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20219#discussion_r160990065 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala --- @@ -356,4 +356,13 @@ class ScalaReflectionSuite extends SparkFunSuite { assert(deserializerFor[Int].isInstanceOf[AssertNotNull]) assert(!deserializerFor[String].isInstanceOf[AssertNotNull]) } + + test("SPARK-23025: schemaFor shuold support Null type") { --- End diff -- should? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20236: [SPARK-23044] Error handling for jira assignment
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20236 **[Test build #85973 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85973/testReport)** for PR 20236 at commit [`8c4cc61`](https://github.com/apache/spark/commit/8c4cc61c2a06b310480fafb3b28067a6f961816a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20236: [SPARK-23044] Error handling for jira assignment
Github user squito commented on the issue: https://github.com/apache/spark/pull/20236 @jerryshao this should fix it, but I don't have anything to merge to test this out -- would appreciate if someone could try it before we merge. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20236: [SPARK-23044] Error handling for jira assignment
GitHub user squito opened a pull request: https://github.com/apache/spark/pull/20236 [SPARK-23044] Error handling for jira assignment ## What changes were proposed in this pull request? In case the selected user isn't a contributor yet, or any other unexpected error, just don't assign the jira. ## How was this patch tested? Couldn't really test the error case, just some testing of similar-ish code in python shell. Haven't run a merge yet. You can merge this pull request into a Git repository by running: $ git pull https://github.com/squito/spark SPARK-23044 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20236.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20236 commit 8c4cc61c2a06b310480fafb3b28067a6f961816a Author: Imran RashidDate: 2018-01-11T15:42:16Z [SPARK-23044] Error handling for jira assignment --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20214: [SPARK-23023][SQL] Cast field data to strings in showStr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20214 **[Test build #85972 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85972/testReport)** for PR 20214 at commit [`66b06c3`](https://github.com/apache/spark/commit/66b06c37bd2a9ed8461f9572eb3f2081853a8d73). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpark
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20217 **[Test build #85971 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85971/testReport)** for PR 20217 at commit [`83776d6`](https://github.com/apache/spark/commit/83776d643c92a9354605320c20313a0380048388). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encode label...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20232 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160985677 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): +raise ValueError("Please use registerUDF for registering UDF. The expected function of " + "registerFunction is a Python function (including lambda function)") +udf = UserDefinedFunction(f, returnType=returnType, name=name, + evalType=PythonEvalType.SQL_BATCHED_UDF) +self._jsparkSession.udf().registerPython(name, udf._judf) +return udf._wrapped() + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF --- End diff -- After checking with our Scala API, I found the name is not changed. Let me update it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160985290 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): +raise ValueError("Please use registerUDF for registering UDF. The expected function of " + "registerFunction is a Python function (including lambda function)") +udf = UserDefinedFunction(f, returnType=returnType, name=name, + evalType=PythonEvalType.SQL_BATCHED_UDF) +self._jsparkSession.udf().registerPython(name, udf._judf) +return udf._wrapped() + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF +:param f: a wrapped/native UserDefinedFunction. The UDF can be either row-at-a-time or --- End diff -- SGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20220: Update PageRank.scala
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20220 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20203: [SPARK-22577] [core] executor page blacklist stat...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/20203#discussion_r160985168 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -223,6 +228,15 @@ private[spark] class AppStatusListener( updateNodeBlackList(event.hostId, false) } + def updateBlackListStatusForStage(executorId: String, stageId: Int, stageAttemptId: Int): Unit = { +Option(liveStages.get((stageId, stageAttemptId))).foreach { stage => + val now = System.nanoTime() + val esummary = stage.executorSummary(executorId) + esummary.isBlacklisted = true + maybeUpdate(esummary, now) +} + } + --- End diff -- @vanzin you are more familiar with the new history server. I am wondering why is the updateBlacklistStatus only done with liveUpdate? Doesn't that mean it won't show up in history server for finished app? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160984479 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): +raise ValueError("Please use registerUDF for registering UDF. The expected function of " + "registerFunction is a Python function (including lambda function)") +udf = UserDefinedFunction(f, returnType=returnType, name=name, + evalType=PythonEvalType.SQL_BATCHED_UDF) +self._jsparkSession.udf().registerPython(name, udf._judf) +return udf._wrapped() + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF +:param f: a wrapped/native UserDefinedFunction. The UDF can be either row-at-a-time or --- End diff -- We can do it as a separate PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20220: Update PageRank.scala
Github user srowen commented on the issue: https://github.com/apache/spark/pull/20220 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20080: [SPARK-22870][CORE] Dynamic allocation should allow 0 id...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/20080 Ping @wangyum --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160983255 --- Diff: python/pyspark/sql/context.py --- @@ -203,18 +203,46 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = sqlContext.udf.register("stringLengthInt", lambda x: len(x), IntegerType()) >>> sqlContext.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" +return self.sparkSession.catalog.registerFunction(name, f, returnType) + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): --- End diff -- I am not sure about the difference between: `spark.udf.registerUDF` `sqlContext.udf.registerUDF` and `sqlContext.registerUDF` Seems too many ways to do the same thing...But if we indeed need to keep multiple methods, I would lean towards having comprehensive doc in one of them and have the doc for the rest to be something like """ Same as :meth:... """ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20226: [SPARK-23034][SQL][UI] Display tablename for `HiveTableS...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20226 LGTM too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160981094 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): --- End diff -- I am fine with it as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160980765 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): +raise ValueError("Please use registerUDF for registering UDF. The expected function of " + "registerFunction is a Python function (including lambda function)") +udf = UserDefinedFunction(f, returnType=returnType, name=name, + evalType=PythonEvalType.SQL_BATCHED_UDF) +self._jsparkSession.udf().registerPython(name, udf._judf) +return udf._wrapped() + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF --- End diff -- I also think changing the property (name) of the input udf object in `registerUDF` is unintuitive.. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20235: [Spark-22887][ML][TESTS][WIP] ML test for Structu...
Github user smurakozi commented on a diff in the pull request: https://github.com/apache/spark/pull/20235#discussion_r160980529 --- Diff: mllib/src/test/scala/org/apache/spark/ml/fpm/FPGrowthSuite.scala --- @@ -34,86 +35,122 @@ class FPGrowthSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul } test("FPGrowth fit and transform with different data types") { -Array(IntegerType, StringType, ShortType, LongType, ByteType).foreach { dt => - val data = dataset.withColumn("items", col("items").cast(ArrayType(dt))) - val model = new FPGrowth().setMinSupport(0.5).fit(data) - val generatedRules = model.setMinConfidence(0.5).associationRules - val expectedRules = spark.createDataFrame(Seq( -(Array("2"), Array("1"), 1.0), -(Array("1"), Array("2"), 0.75) - )).toDF("antecedent", "consequent", "confidence") -.withColumn("antecedent", col("antecedent").cast(ArrayType(dt))) -.withColumn("consequent", col("consequent").cast(ArrayType(dt))) - assert(expectedRules.sort("antecedent").rdd.collect().sameElements( -generatedRules.sort("antecedent").rdd.collect())) - - val transformed = model.transform(data) - val expectedTransformed = spark.createDataFrame(Seq( -(0, Array("1", "2"), Array.emptyIntArray), -(0, Array("1", "2"), Array.emptyIntArray), -(0, Array("1", "2"), Array.emptyIntArray), -(0, Array("1", "3"), Array(2)) - )).toDF("id", "items", "prediction") -.withColumn("items", col("items").cast(ArrayType(dt))) -.withColumn("prediction", col("prediction").cast(ArrayType(dt))) - assert(expectedTransformed.collect().toSet.equals( -transformed.collect().toSet)) + class DataTypeWithEncoder[A](val a: DataType) + (implicit val encoder: Encoder[(Int, Array[A], Array[A])]) --- End diff -- Done, thx. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160980191 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): +raise ValueError("Please use registerUDF for registering UDF. The expected function of " + "registerFunction is a Python function (including lambda function)") +udf = UserDefinedFunction(f, returnType=returnType, name=name, + evalType=PythonEvalType.SQL_BATCHED_UDF) +self._jsparkSession.udf().registerPython(name, udf._judf) +return udf._wrapped() + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF +:param f: a wrapped/native UserDefinedFunction. The UDF can be either row-at-a-time or --- End diff -- Yeah. I am fine with the doc here. But similar to https://github.com/apache/spark/pull/20217#discussion_r160979747 we should probably standardize the description. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20235: [Spark-22887][ML][TESTS][WIP] ML test for Structu...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20235#discussion_r160979218 --- Diff: mllib/src/test/scala/org/apache/spark/ml/fpm/FPGrowthSuite.scala --- @@ -34,86 +35,122 @@ class FPGrowthSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul } test("FPGrowth fit and transform with different data types") { -Array(IntegerType, StringType, ShortType, LongType, ByteType).foreach { dt => - val data = dataset.withColumn("items", col("items").cast(ArrayType(dt))) - val model = new FPGrowth().setMinSupport(0.5).fit(data) - val generatedRules = model.setMinConfidence(0.5).associationRules - val expectedRules = spark.createDataFrame(Seq( -(Array("2"), Array("1"), 1.0), -(Array("1"), Array("2"), 0.75) - )).toDF("antecedent", "consequent", "confidence") -.withColumn("antecedent", col("antecedent").cast(ArrayType(dt))) -.withColumn("consequent", col("consequent").cast(ArrayType(dt))) - assert(expectedRules.sort("antecedent").rdd.collect().sameElements( -generatedRules.sort("antecedent").rdd.collect())) - - val transformed = model.transform(data) - val expectedTransformed = spark.createDataFrame(Seq( -(0, Array("1", "2"), Array.emptyIntArray), -(0, Array("1", "2"), Array.emptyIntArray), -(0, Array("1", "2"), Array.emptyIntArray), -(0, Array("1", "3"), Array(2)) - )).toDF("id", "items", "prediction") -.withColumn("items", col("items").cast(ArrayType(dt))) -.withColumn("prediction", col("prediction").cast(ArrayType(dt))) - assert(expectedTransformed.collect().toSet.equals( -transformed.collect().toSet)) + class DataTypeWithEncoder[A](val a: DataType) + (implicit val encoder: Encoder[(Int, Array[A], Array[A])]) --- End diff -- In DataTypeWithEncoder I would suggest to rename the val "a" to "dataType". --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160979747 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): +raise ValueError("Please use registerUDF for registering UDF. The expected function of " + "registerFunction is a Python function (including lambda function)") +udf = UserDefinedFunction(f, returnType=returnType, name=name, + evalType=PythonEvalType.SQL_BATCHED_UDF) +self._jsparkSession.udf().registerPython(name, udf._judf) +return udf._wrapped() + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF +:param f: a wrapped/native UserDefinedFunction. The UDF can be either row-at-a-time or + scalar vectorized. Grouped vectorized UDFs are not supported. --- End diff -- Ok. Maybe as a follow up we can standardize the language to describe the `udf` and `pandas_udf` objects. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20072 **[Test build #85970 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85970/testReport)** for PR 20072 at commit [`6fe8589`](https://github.com/apache/spark/commit/6fe85892ca6c53b3aa965951dc7ef51df9d068e2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20183: [SPARK-22986][Core] Use a cache to avoid instantiating m...
Github user ho3rexqj commented on the issue: https://github.com/apache/spark/pull/20183 Updated the above, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20158: [PySpark] Fix typo in comments in PySpark's udf() defini...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20158 @rednaxelafx, can you fix the one in `pandas_udf` too? I'll just merge this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20204 **[Test build #85969 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85969/testReport)** for PR 20204 at commit [`e8e7112`](https://github.com/apache/spark/commit/e8e71128951f64d6f70c0bac1729e33629bf7dd1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20235: [Spark-22887][ML][TESTS][WIP] ML test for Structu...
Github user smurakozi commented on a diff in the pull request: https://github.com/apache/spark/pull/20235#discussion_r160969767 --- Diff: mllib/src/test/scala/org/apache/spark/ml/fpm/FPGrowthSuite.scala --- @@ -34,86 +35,122 @@ class FPGrowthSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul } test("FPGrowth fit and transform with different data types") { -Array(IntegerType, StringType, ShortType, LongType, ByteType).foreach { dt => - val data = dataset.withColumn("items", col("items").cast(ArrayType(dt))) - val model = new FPGrowth().setMinSupport(0.5).fit(data) - val generatedRules = model.setMinConfidence(0.5).associationRules - val expectedRules = spark.createDataFrame(Seq( -(Array("2"), Array("1"), 1.0), -(Array("1"), Array("2"), 0.75) - )).toDF("antecedent", "consequent", "confidence") -.withColumn("antecedent", col("antecedent").cast(ArrayType(dt))) -.withColumn("consequent", col("consequent").cast(ArrayType(dt))) - assert(expectedRules.sort("antecedent").rdd.collect().sameElements( -generatedRules.sort("antecedent").rdd.collect())) - - val transformed = model.transform(data) - val expectedTransformed = spark.createDataFrame(Seq( -(0, Array("1", "2"), Array.emptyIntArray), -(0, Array("1", "2"), Array.emptyIntArray), -(0, Array("1", "2"), Array.emptyIntArray), -(0, Array("1", "3"), Array(2)) - )).toDF("id", "items", "prediction") -.withColumn("items", col("items").cast(ArrayType(dt))) -.withColumn("prediction", col("prediction").cast(ArrayType(dt))) - assert(expectedTransformed.collect().toSet.equals( -transformed.collect().toSet)) + class DataTypeWithEncoder[A](val a: DataType) + (implicit val encoder: Encoder[(Int, Array[A], Array[A])]) --- End diff -- This class is needed for two purposes: 1. to connect data types with their corresponding DataType. Note: this information is already available in AtomicType as InternalType, but it's not accessible. Using it from this test doesn't justify making it public. 2. to get the proper encoder to the testTransformer method. As the datatypes are put into an array dt is inferred to be their parent type, and implicit search is able to find the encoders only for concrete types. For a similar reason, we need to use the type of the final encoder. If we have only the encoder for A implicit search will not be able to construct Array[A], as we have implicit encoders for Array[Int], Array[Short]... but not for generic A, having an encoder. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20153 **[Test build #85968 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85968/testReport)** for PR 20153 at commit [`b8a700d`](https://github.com/apache/spark/commit/b8a700d87d3708bae34054a00ad5d489280e5852). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20153 I think `ColumnarBatchScan` is fine, `SupportsScanColumnarBatch` also has a `enableBatchRead` to fallback. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160960075 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): +raise ValueError("Please use registerUDF for registering UDF. The expected function of " + "registerFunction is a Python function (including lambda function)") +udf = UserDefinedFunction(f, returnType=returnType, name=name, + evalType=PythonEvalType.SQL_BATCHED_UDF) +self._jsparkSession.udf().registerPython(name, udf._judf) +return udf._wrapped() + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF --- End diff -- Because I thought it registers it to SQL statement with the given name and the defined UDF instance (`f`) shouldn't change its name itself at the register time. Up to my knowledge, we happened to have `registerUDF` to avoid such problem in `returnType`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20235: [Spark-22887][ML][TESTS][WIP] ML test for StructuredStre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20235 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20151: [SPARK-22959][PYTHON] Configuration to select the module...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20151 Will merge this one if there isn't any objection. I believe this doesn't affect the existing code path anyway .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test sui...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20218 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160958943 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): +raise ValueError("Please use registerUDF for registering UDF. The expected function of " + "registerFunction is a Python function (including lambda function)") +udf = UserDefinedFunction(f, returnType=returnType, name=name, + evalType=PythonEvalType.SQL_BATCHED_UDF) +self._jsparkSession.udf().registerPython(name, udf._judf) +return udf._wrapped() + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF --- End diff -- Why? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20234: [SPARK-19732] [Follow-up] Document behavior chang...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20234 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20231: [SPARK-23000][TEST-HADOOP2.6] Fix Flaky test suite DataS...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20231 I saw the test history. `DataSourceWithHiveMetastoreCatalogSuite ` still can pass --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20234: [SPARK-19732] [Follow-up] Document behavior changes made...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20234 Merged to master and branch-2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20235: [Spark-22887][ML][TESTS][WIP] ML test for Structu...
GitHub user smurakozi opened a pull request: https://github.com/apache/spark/pull/20235 [Spark-22887][ML][TESTS][WIP] ML test for StructuredStreaming: spark.ml.fpm ## What changes were proposed in this pull request? Converting FPGrowth tests to also check code with structured streaming, using the ML testing infrastructure implemented in SPARK-22882. Note: this is a WIP, test with Array[Byte] is not yet working due to some datatype issues (Array[Byte] vs Binary). ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/smurakozi/spark SPARK-22887 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20235.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20235 commit 331129556003bcf6e4bab6559e80e46ac0858706 Author: Sandor MurakoziDate: 2018-01-05T12:41:53Z test 'FPGrowthModel setMinConfidence should affect rules generation and transform' is converted to use testTransformer commit 93aff2c999eee4a88f7f4a3c32d6c7b601a918ac Author: Sandor Murakozi Date: 2018-01-08T13:14:38Z Test 'FPGrowth fit and transform with different data types' works with streaming, except for Byte commit 8b0b00070a21bd47537a7c3ad580e2af38a481bd Author: Sandor Murakozi Date: 2018-01-11T11:28:46Z All tests use testTransformer. Test with Array[Byte] is missing. commit af61845ab6acfa82c4411bce3ab4a20afebd0aa3 Author: Sandor Murakozi Date: 2018-01-11T11:49:27Z Unintentional changes in 93aff2c999 are reverted --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20153#discussion_r160958116 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java --- @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.reader; + +import java.util.List; + +import org.apache.spark.annotation.InterfaceStability; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.vectorized.ColumnarBatch; + +/** + * A mix-in interface for {@link DataSourceV2Reader}. Data source readers can implement this + * interface to output {@link ColumnarBatch} and make the scan faster. + */ +@InterfaceStability.Evolving +public interface SupportsScanColumnarBatch extends DataSourceV2Reader { + @Override + default ListcreateReadTasks() { +throw new IllegalStateException( + "createReadTasks should not be called with SupportsScanColumnarBatch."); + } + + /** + * Similar to {@link DataSourceV2Reader#createReadTasks()}, but returns columnar data in batches. + */ + List createBatchReadTasks(); + + /** + * A safety door for columnar batch reader. It's possible that the implementation can only support + * some certain columns with certain types. Users can overwrite this method and + * {@link #createReadTasks()} to fallback to normal read path under some conditions. + */ + default boolean enableBatchRead() { --- End diff -- Yea you can interpret it in this way (read data from columnar storage or row storage), but we can also interpret it as reading a batch of records at a time or one record at a time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20217 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85964/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20217 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpark
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20217 **[Test build #85964 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85964/testReport)** for PR 20217 at commit [`065a21b`](https://github.com/apache/spark/commit/065a21b879bde3e3de2c506d12e8dec3ed48caab). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20223#discussion_r160957301 --- Diff: core/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java --- @@ -137,7 +139,9 @@ public void testInProcessLauncher() throws Exception { // Here DAGScheduler is stopped, while SparkContext.clearActiveContext may not be called yet. // Wait for a reasonable amount of time to avoid creating two active SparkContext in JVM. --- End diff -- why is `500 ms` not a reasonable waiting time anymore? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20189: [SPARK-22975][SS] MetricsReporter should not throw excep...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20189 **[Test build #85967 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85967/testReport)** for PR 20189 at commit [`c67e07b`](https://github.com/apache/spark/commit/c67e07b66726d57c2b7a3462b226439518d9ebed). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20223#discussion_r160956968 --- Diff: launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java --- @@ -95,15 +95,15 @@ protected synchronized void send(Message msg) throws IOException { } @Override - public void close() throws IOException { + public synchronized void close() throws IOException { if (!closed) { - synchronized (this) { --- End diff -- what's wrong with this? It's a classic "double-checked locking" in java. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20234: [SPARK-19732] [Follow-up] Document behavior changes made...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20234 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85965/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20234: [SPARK-19732] [Follow-up] Document behavior changes made...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20234 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20234: [SPARK-19732] [Follow-up] Document behavior changes made...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20234 **[Test build #85965 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85965/testReport)** for PR 20234 at commit [`a2475ea`](https://github.com/apache/spark/commit/a2475ea5b86acee2380884db0756a833016b69a0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20199: [Spark-22967][TESTS]Fix VersionSuite's unit tests...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20199 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20222 **[Test build #85966 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85966/testReport)** for PR 20222 at commit [`5eded03`](https://github.com/apache/spark/commit/5eded033a0b352e7a799c7890131d8075475c8ff). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20199: [Spark-22967][TESTS]Fix VersionSuite's unit tests by cha...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20199 Merged to master and branch-2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20218 Thanks! Merged to master/2.3. Hopefully, this can fix the test failure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20222 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20234: [SPARK-19732] [Follow-up] Document behavior changes made...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20234 **[Test build #85965 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85965/testReport)** for PR 20234 at commit [`a2475ea`](https://github.com/apache/spark/commit/a2475ea5b86acee2380884db0756a833016b69a0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160953629 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): +raise ValueError("Please use registerUDF for registering UDF. The expected function of " + "registerFunction is a Python function (including lambda function)") +udf = UserDefinedFunction(f, returnType=returnType, name=name, + evalType=PythonEvalType.SQL_BATCHED_UDF) +self._jsparkSession.udf().registerPython(name, udf._judf) +return udf._wrapped() + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF --- End diff -- Hm .. actually I think we should not change the name of the returned UDF? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20223#discussion_r160952457 --- Diff: launcher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java --- @@ -91,10 +92,15 @@ LauncherConnection getConnection() { return connection; } - boolean isDisposed() { + synchronized boolean isDisposed() { return disposed; --- End diff -- can we simply mark `disposed` as `transient`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20234: [SPARK-19732] [Follow-up] Document behavior chang...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20234#discussion_r160952123 --- Diff: docs/sql-programming-guide.md --- @@ -1788,12 +1788,10 @@ options. Note that, for DecimalType(38,0)*, the table above intentionally does not cover all other combinations of scales and precisions because currently we only infer decimal type like `BigInteger`/`BigInt`. For example, 1.1 is inferred as double type. - In PySpark, now we need Pandas 0.19.2 or upper if you want to use Pandas related functionalities, such as `toPandas`, `createDataFrame` from Pandas DataFrame, etc. - In PySpark, the behavior of timestamp values for Pandas related functionalities was changed to respect session timezone. If you want to use the old behavior, you need to set a configuration `spark.sql.execution.pandas.respectSessionTimeZone` to `False`. See [SPARK-22395](https://issues.apache.org/jira/browse/SPARK-22395) for details. - - - Since Spark 2.3, when either broadcast hash join or broadcast nested loop join is applicable, we prefer to broadcasting the table that is explicitly specified in a broadcast hint. For details, see the section [Broadcast Hint](#broadcast-hint-for-sql-queries) and [SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489). - - - Since Spark 2.3, when all inputs are binary, `functions.concat()` returns an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it always returns as a string despite of input types. To keep the old behavior, set `spark.sql.function.concatBinaryAsString` to `true`. - - - Since Spark 2.3, when all inputs are binary, SQL `elt()` returns an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it always returns as a string despite of input types. To keep the old behavior, set `spark.sql.function.eltOutputAsString` to `true`. + - In PySpark, `na.fill()` or `fillna` also accepts boolean and replaces NAs with booleans. In prior Spark versions, PySpark just ignores it and returns the original Dataset/DataFrame. --- End diff -- Sounds good to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpark
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20217 **[Test build #85964 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85964/testReport)** for PR 20217 at commit [`065a21b`](https://github.com/apache/spark/commit/065a21b879bde3e3de2c506d12e8dec3ed48caab). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160950622 --- Diff: python/pyspark/sql/context.py --- @@ -578,6 +606,9 @@ def __init__(self, sqlContext): def register(self, name, f, returnType=StringType()): return self.sqlContext.registerFunction(name, f, returnType) +def registerUDF(self, name, f): --- End diff -- The examples in these docs are different. Thus, I prefer to keeping it untouched. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160950521 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): +raise ValueError("Please use registerUDF for registering UDF. The expected function of " + "registerFunction is a Python function (including lambda function)") +udf = UserDefinedFunction(f, returnType=returnType, name=name, + evalType=PythonEvalType.SQL_BATCHED_UDF) +self._jsparkSession.udf().registerPython(name, udf._judf) +return udf._wrapped() + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF +:param f: a wrapped/native UserDefinedFunction. The UDF can be either row-at-a-time or + scalar vectorized. Grouped vectorized UDFs are not supported. +:return: a wrapped :class:`UserDefinedFunction` + +>>> from pyspark.sql.types import IntegerType +>>> from pyspark.sql.functions import udf +>>> slen = udf(lambda s: len(s), IntegerType()) +>>> _ = spark.udf.registerUDF("slen", slen) +>>> spark.sql("SELECT slen('test')").collect() +[Row(slen(test)=4)] >>> import random >>> from pyspark.sql.functions import udf ->>> from pyspark.sql.types import IntegerType, StringType +>>> from pyspark.sql.types import IntegerType >>> random_udf = udf(lambda: random.randint(0, 100), IntegerType()).asNondeterministic() ->>> newRandom_udf = spark.catalog.registerFunction("random_udf", random_udf, StringType()) +>>> newRandom_udf = spark.catalog.registerUDF("random_udf", random_udf) >>> spark.sql("SELECT random_udf()").collect() # doctest: +SKIP -[Row(random_udf()=u'82')] +[Row(random_udf()=82)] >>> spark.range(1).select(newRandom_udf()).collect() # doctest: +SKIP -[Row(random_udf()=u'62')] +[Row(random_udf()=62)] + +>>> from pyspark.sql.functions import pandas_udf, PandasUDFType +>>> @pandas_udf("integer", PandasUDFType.SCALAR) # doctest: +SKIP +... def add_one(x): +... return x + 1 +... +>>> _ = spark.udf.registerUDF("add_one", add_one) # doctest: +SKIP +>>> spark.sql("SELECT add_one(id) FROM range(10)").collect() # doctest: +SKIP --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160950460 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): +raise ValueError("Please use registerUDF for registering UDF. The expected function of " + "registerFunction is a Python function (including lambda function)") +udf = UserDefinedFunction(f, returnType=returnType, name=name, + evalType=PythonEvalType.SQL_BATCHED_UDF) +self._jsparkSession.udf().registerPython(name, udf._judf) +return udf._wrapped() + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF +:param f: a wrapped/native UserDefinedFunction. The UDF can be either row-at-a-time or + scalar vectorized. Grouped vectorized UDFs are not supported. --- End diff -- I think the following examples are self descriptive. I can emphasize it in the description too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160950226 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): +raise ValueError("Please use registerUDF for registering UDF. The expected function of " + "registerFunction is a Python function (including lambda function)") +udf = UserDefinedFunction(f, returnType=returnType, name=name, + evalType=PythonEvalType.SQL_BATCHED_UDF) +self._jsparkSession.udf().registerPython(name, udf._judf) +return udf._wrapped() + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF +:param f: a wrapped/native UserDefinedFunction. The UDF can be either row-at-a-time or --- End diff -- > :param udf: A function object returned by :meth:`pyspark.sql.functions.pandas_udf` This is actually not accurate. We have another way to define the scalar vectorized UDF. ``` >>> @pandas_udf("integer", PandasUDFType.SCALAR) # doctest: +SKIP ... def add_one(x): ... return x + 1 ... ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160949917 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): --- End diff -- Nope, to be clear I am fine with it as is. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160949520 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): +raise ValueError("Please use registerUDF for registering UDF. The expected function of " + "registerFunction is a Python function (including lambda function)") +udf = UserDefinedFunction(f, returnType=returnType, name=name, + evalType=PythonEvalType.SQL_BATCHED_UDF) +self._jsparkSession.udf().registerPython(name, udf._judf) +return udf._wrapped() + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF --- End diff -- Actually, `name of the UDF in SQL statement` is wrong. The name is also used as the name of returned UDF. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20234: [SPARK-19732] [Follow-up] Document behavior chang...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20234#discussion_r160949121 --- Diff: docs/sql-programming-guide.md --- @@ -1788,12 +1788,10 @@ options. Note that, for DecimalType(38,0)*, the table above intentionally does not cover all other combinations of scales and precisions because currently we only infer decimal type like `BigInteger`/`BigInt`. For example, 1.1 is inferred as double type. - In PySpark, now we need Pandas 0.19.2 or upper if you want to use Pandas related functionalities, such as `toPandas`, `createDataFrame` from Pandas DataFrame, etc. - In PySpark, the behavior of timestamp values for Pandas related functionalities was changed to respect session timezone. If you want to use the old behavior, you need to set a configuration `spark.sql.execution.pandas.respectSessionTimeZone` to `False`. See [SPARK-22395](https://issues.apache.org/jira/browse/SPARK-22395) for details. - - - Since Spark 2.3, when either broadcast hash join or broadcast nested loop join is applicable, we prefer to broadcasting the table that is explicitly specified in a broadcast hint. For details, see the section [Broadcast Hint](#broadcast-hint-for-sql-queries) and [SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489). - - - Since Spark 2.3, when all inputs are binary, `functions.concat()` returns an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it always returns as a string despite of input types. To keep the old behavior, set `spark.sql.function.concatBinaryAsString` to `true`. - - - Since Spark 2.3, when all inputs are binary, SQL `elt()` returns an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it always returns as a string despite of input types. To keep the old behavior, set `spark.sql.function.eltOutputAsString` to `true`. + - In PySpark, `na.fill()` or `fillna` also accepts boolean and replaces NAs with booleans. In prior Spark versions, PySpark just ignores it and returns the original Dataset/DataFrame. --- End diff -- Shall we say `null` instead of `NA`? I actually think `null` is more correct. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160949342 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): --- End diff -- The current way is simple. Any hole? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160948952 --- Diff: python/pyspark/sql/context.py --- @@ -203,18 +203,46 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = sqlContext.udf.register("stringLengthInt", lambda x: len(x), IntegerType()) >>> sqlContext.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" +return self.sparkSession.catalog.registerFunction(name, f, returnType) + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): --- End diff -- The test cases are actually different. In this file, all the examples are using sqlContext. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20234: [SPARK-19732] [Follow-up] Document behavior changes made...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20234 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20234: [SPARK-19732] [Follow-up] Document behavior changes made...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20234 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85963/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20234: [SPARK-19732] [Follow-up] Document behavior changes made...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20234 **[Test build #85963 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85963/testReport)** for PR 20234 at commit [`ff30553`](https://github.com/apache/spark/commit/ff30553092a7bfe8d9aac3fc1f89b99ff679a2aa). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20218 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85959/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20218 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org