[GitHub] spark pull request #20183: [SPARK-22986][Core] Fix/cache broadcast values
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/20183#discussion_r160881323 --- Diff: core/src/test/scala/org/apache/spark/broadcast/BroadcastSuite.scala --- @@ -153,6 +153,40 @@ class BroadcastSuite extends SparkFunSuite with LocalSparkContext with Encryptio assert(broadcast.value.sum === 10) } + test("One broadcast value instance per executor") { +val conf = new SparkConf() + .setMaster("local[10]") --- End diff -- nit: normally `local[4]` is sufficient. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20183: [SPARK-22986][Core] Fix/cache broadcast values
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20183 Please also update the PR title and description. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20230: [SPARK-23038][TEST] Update docker/spark-test (JDK/OS)
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20230 **[Test build #85953 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85953/testReport)** for PR 20230 at commit [`cc3321c`](https://github.com/apache/spark/commit/cc3321c20fd0dc2ef75a8740b5c0292beef98beb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20230: [SPARK-23038][TEST] Update docker/spark-test (JDK...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20230#discussion_r160880463 --- Diff: external/docker/spark-test/base/Dockerfile --- @@ -15,14 +15,14 @@ # limitations under the License. # -FROM ubuntu:precise +FROM ubuntu:xenial # Upgrade package index -# install a few other useful packages plus Open Jdk 7 +# install a few other useful packages plus Open Jdk 8 # Remove unneeded /var/lib/apt/lists/* after install to reduce the # docker image size (by ~30MB) RUN apt-get update && \ -apt-get install -y less openjdk-7-jre-headless net-tools vim-tiny sudo openssh-server && \ +apt-get install -y less openjdk-8-jre-headless iproute2 vim-tiny sudo openssh-server && \ --- End diff -- This is required to use [ip](https://github.com/apache/spark/blob/master/external/docker/spark-test/master/default_cmd#L20) command in `xenial`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/20153#discussion_r160880016 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala --- @@ -37,40 +35,58 @@ import org.apache.spark.sql.types.StructType */ case class DataSourceV2ScanExec( fullOutput: Seq[AttributeReference], -@transient reader: DataSourceV2Reader) extends LeafExecNode with DataSourceReaderHolder { +@transient reader: DataSourceV2Reader) + extends LeafExecNode with DataSourceReaderHolder with ColumnarBatchScan { override def canEqual(other: Any): Boolean = other.isInstanceOf[DataSourceV2ScanExec] - override def references: AttributeSet = AttributeSet.empty - - override lazy val metrics = Map( -"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows")) - - override protected def doExecute(): RDD[InternalRow] = { -val readTasks: java.util.List[ReadTask[UnsafeRow]] = reader match { - case r: SupportsScanUnsafeRow => r.createUnsafeRowReadTasks() - case _ => -reader.createReadTasks().asScala.map { - new RowToUnsafeRowReadTask(_, reader.readSchema()): ReadTask[UnsafeRow] -}.asJava -} + override def producedAttributes: AttributeSet = AttributeSet(fullOutput) + + private lazy val inputRDD: RDD[InternalRow] = reader match { +case r: SupportsScanColumnarBatch if r.enableBatchRead() => + assert(!reader.isInstanceOf[ContinuousReader], +"continuous stream reader does not support columnar read yet.") + new DataSourceRDD(sparkContext, r.createBatchReadTasks()).asInstanceOf[RDD[InternalRow]] + +case _ => + val readTasks: java.util.List[ReadTask[UnsafeRow]] = reader match { +case r: SupportsScanUnsafeRow => r.createUnsafeRowReadTasks() +case _ => + reader.createReadTasks().asScala.map { +new RowToUnsafeRowReadTask(_, reader.readSchema()): ReadTask[UnsafeRow] + }.asJava + } + + reader match { --- End diff -- This looks a bit messy, can we move `readTasks` out as a lazy val then we may have: ``` private lazy val readTasks = .. private lazy val inputRDD: RDD[InternalRow] = reader match { case r: SupportsScanColumnarBatch if r.enableBatchRead() => .. case _: ContinuousReader => .. case _ => .. } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/20153#discussion_r160877601 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java --- @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.reader; + +import java.util.List; + +import org.apache.spark.annotation.InterfaceStability; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.vectorized.ColumnarBatch; + +/** + * A mix-in interface for {@link DataSourceV2Reader}. Data source readers can implement this + * interface to output {@link ColumnarBatch} and make the scan faster. + */ +@InterfaceStability.Evolving +public interface SupportsScanColumnarBatch extends DataSourceV2Reader { + @Override + default ListcreateReadTasks() { +throw new IllegalStateException( + "createReadTasks should not be called with SupportsScanColumnarBatch."); --- End diff -- `createReadTasks not supported by default within SupportsScanColumnarBatch.`, since we allow users to fallback to normal read path. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20230: [SPARK-23038][TEST] Update docker/spark-test (JDK...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/20230 [SPARK-23038][TEST] Update docker/spark-test (JDK/OS) ## What changes were proposed in this pull request? This PR aims to update the followings in `docker/spark-test`. - JDK7 -> JDK8 Spark 2.2+ supports JDK8 only. - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel) The end of life of `precise` was April 28, 2017. ## How was this patch tested? Manual. * Master ``` $ cd external/docker $ ./build $ export SPARK_HOME=... $ docker run -v $SPARK_HOME:/opt/spark spark-test-master CONTAINER_IP=172.17.0.3 ... 18/01/11 06:50:25 INFO MasterWebUI: Bound MasterWebUI to 172.17.0.3, and started at http://172.17.0.3:8080 18/01/11 06:50:25 INFO Utils: Successfully started service on port 6066. 18/01/11 06:50:25 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066 18/01/11 06:50:25 INFO Master: I have been elected leader! New state: ALIVE ``` * Slave ``` $ docker run -v $SPARK_HOME:/opt/spark spark-test-worker spark://172.17.0.3:7077 CONTAINER_IP=172.17.0.4 ... 18/01/11 06:51:54 INFO Worker: Successfully registered with master spark://172.17.0.3:7077 ``` After slave starts, master will show ``` 18/01/11 06:51:54 INFO Master: Registering worker 172.17.0.4: with 4 cores, 1024.0 MB RAM ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-23038 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20230.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20230 commit cc3321c20fd0dc2ef75a8740b5c0292beef98beb Author: Dongjoon HyunDate: 2018-01-11T07:18:48Z [SPARK-23038][TEST] Update docker/spark-test (JDK/OS) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20199: [Spark-22967][TESTS]Fix VersionSuite's unit tests...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20199#discussion_r160879315 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -842,6 +842,7 @@ class VersionsSuite extends SparkFunSuite with Logging { } test(s"$version: SPARK-17920: Insert into/overwrite avro table") { + assume(!(Utils.isWindows && version == "0.12")) --- End diff -- Nice. Yea, let's merge this one after adding a comment saying it's skipped because it's failed in the condition on Windows. Then, see if we can fix the root cause, and then re-enable this test in the PR if everything goes well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20229: [SPARK-23037][ML] Update RFormula to use VectorSi...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20229#discussion_r160878813 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -199,6 +199,7 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String) val parsedFormula = RFormulaParser.parse($(formula)) val resolvedFormula = parsedFormula.resolve(dataset.schema) val encoderStages = ArrayBuffer[PipelineStage]() +val oneHotEncodeStages = ArrayBuffer[(String, String)]() --- End diff -- `oneHotEncodeColumns` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20229: [SPARK-23037][ML] Update RFormula to use VectorSi...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20229#discussion_r160878716 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -228,22 +229,33 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String) // Then we handle one-hot encoding and interactions between terms. var keepReferenceCategory = false val encodedTerms = resolvedFormula.terms.map { - case Seq(term) if dataset.schema(term).dataType == StringType => -val encodedCol = tmpColumn("onehot") -var encoder = new OneHotEncoder() - .setInputCol(indexed(term)) - .setOutputCol(encodedCol) -// Formula w/o intercept, one of the categories in the first category feature is -// being used as reference category, we will not drop any category for that feature. -if (!hasIntercept && !keepReferenceCategory) { - encoder = encoder.setDropLast(false) - keepReferenceCategory = true -} -encoderStages += encoder -prefixesToRewrite(encodedCol + "_") = term + "_" -encodedCol case Seq(term) => -term +dataset.schema(term).dataType match { + case _: StringType => +val encodedCol = tmpColumn("onehot") +// Formula w/o intercept, one of the categories in the first category feature is +// being used as reference category, we will not drop any category for that feature. +if (!hasIntercept && !keepReferenceCategory) { + keepReferenceCategory = true + encoderStages += new OneHotEncoderEstimator(uid) --- End diff -- Oh, I see. There is one `OneHotEncoderEstimator` for `dropLast=false`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20229: [SPARK-23037][ML] Update RFormula to use VectorSizeHint ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20229 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85950/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20229: [SPARK-23037][ML] Update RFormula to use VectorSizeHint ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20229 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20229: [SPARK-23037][ML] Update RFormula to use VectorSizeHint ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20229 **[Test build #85950 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85950/testReport)** for PR 20229 at commit [`7bad275`](https://github.com/apache/spark/commit/7bad275cd995d22d70a42a5b9932073bc3a5d1e5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20229: [SPARK-23037][ML] Update RFormula to use VectorSi...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20229#discussion_r160877856 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -228,22 +229,33 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String) // Then we handle one-hot encoding and interactions between terms. var keepReferenceCategory = false val encodedTerms = resolvedFormula.terms.map { - case Seq(term) if dataset.schema(term).dataType == StringType => -val encodedCol = tmpColumn("onehot") -var encoder = new OneHotEncoder() - .setInputCol(indexed(term)) - .setOutputCol(encodedCol) -// Formula w/o intercept, one of the categories in the first category feature is -// being used as reference category, we will not drop any category for that feature. -if (!hasIntercept && !keepReferenceCategory) { - encoder = encoder.setDropLast(false) - keepReferenceCategory = true -} -encoderStages += encoder -prefixesToRewrite(encodedCol + "_") = term + "_" -encodedCol case Seq(term) => -term +dataset.schema(term).dataType match { + case _: StringType => +val encodedCol = tmpColumn("onehot") +// Formula w/o intercept, one of the categories in the first category feature is +// being used as reference category, we will not drop any category for that feature. +if (!hasIntercept && !keepReferenceCategory) { + keepReferenceCategory = true + encoderStages += new OneHotEncoderEstimator(uid) --- End diff -- As `OneHotEncoderEstimator` can handle multi-column now, can't we just have one `OneHotEncoderEstimator` to fit for all terms at once? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19872 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85947/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19872 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19872 **[Test build #85947 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85947/testReport)** for PR 19872 at commit [`cb36227`](https://github.com/apache/spark/commit/cb362274711c1b26ed19e87aa15bc8c64668eae6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20229: [SPARK-23037][ML] Update RFormula to use VectorSizeHint ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20229 Actually I think this can be separated in two different PRs. One for OneHotEncoderEstimator and one for VectorSizeHint. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160877051 --- Diff: python/pyspark/sql/context.py --- @@ -578,6 +606,9 @@ def __init__(self, sqlContext): def register(self, name, f, returnType=StringType()): return self.sqlContext.registerFunction(name, f, returnType) +def registerUDF(self, name, f): --- End diff -- Yup +1 like https://github.com/apache/spark/pull/20217/files/f25669a4b6c2298359df1b9083037468652cd141#r160861434 but how about checking and doing this in batch? Seems we should fix the doctests too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20222 **[Test build #85952 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85952/testReport)** for PR 20222 at commit [`5eded03`](https://github.com/apache/spark/commit/5eded033a0b352e7a799c7890131d8075475c8ff). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20072 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85949/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20072 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20072 **[Test build #85949 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85949/testReport)** for PR 20072 at commit [`5230081`](https://github.com/apache/spark/commit/523008192cb1476a68cf3808f46574598fcc6d2d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160876638 --- Diff: python/pyspark/sql/context.py --- @@ -203,18 +203,46 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = sqlContext.udf.register("stringLengthInt", lambda x: len(x), IntegerType()) >>> sqlContext.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" +return self.sparkSession.catalog.registerFunction(name, f, returnType) + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): --- End diff -- And also, seems we need to fix the examples too in this case .. `spark` -> `sqlContext`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20222 Thanks for your review! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20229: [SPARK-23037][ML] Update RFormula to use VectorSizeHint ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20229 we need to get it to kick off R tests - could you touch one of the files under R/? also please update PR to include [SPARKR] --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160873545 --- Diff: python/pyspark/sql/context.py --- @@ -203,18 +203,46 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = sqlContext.udf.register("stringLengthInt", lambda x: len(x), IntegerType()) >>> sqlContext.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" +return self.sparkSession.catalog.registerFunction(name, f, returnType) + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF +:param f: a wrapped/native UserDefinedFunction. The UDF can be either row-at-a-time or + scalar vectorized. Grouped vectorized UDFs are not supported. +:return: a wrapped :class:`UserDefinedFunction` + +>>> from pyspark.sql.types import IntegerType +>>> from pyspark.sql.functions import udf +>>> slen = udf(lambda s: len(s), IntegerType()) +>>> _ = sqlContext.udf.registerUDF("slen", slen) +>>> sqlContext.sql("SELECT slen('test')").collect() +[Row(slen(test)=4)] >>> import random >>> from pyspark.sql.functions import udf ->>> from pyspark.sql.types import IntegerType, StringType +>>> from pyspark.sql.types import IntegerType >>> random_udf = udf(lambda: random.randint(0, 100), IntegerType()).asNondeterministic() ->>> newRandom_udf = sqlContext.registerFunction("random_udf", random_udf, StringType()) +>>> newRandom_udf = sqlContext.registerUDF("random_udf", random_udf) >>> sqlContext.sql("SELECT random_udf()").collect() # doctest: +SKIP -[Row(random_udf()=u'82')] +[Row(random_udf()=82)] >>> sqlContext.range(1).select(newRandom_udf()).collect() # doctest: +SKIP -[Row(random_udf()=u'62')] +[Row(random_udf()=62)] + +>>> from pyspark.sql.functions import pandas_udf, PandasUDFType +>>> @pandas_udf("integer", PandasUDFType.SCALAR) # doctest: +SKIP +... def add_one(x): +... return x + 1 +... +>>> _ = sqlContext.udf.registerUDF("add_one", add_one) # doctest: +SKIP +>>> sqlContext.sql("SELECT add_one(id) FROM range(10)").collect() # doctest: +SKIP --- End diff -- ditto. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160872774 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): +raise ValueError("Please use registerUDF for registering UDF. The expected function of " + "registerFunction is a Python function (including lambda function)") +udf = UserDefinedFunction(f, returnType=returnType, name=name, + evalType=PythonEvalType.SQL_BATCHED_UDF) +self._jsparkSession.udf().registerPython(name, udf._judf) +return udf._wrapped() + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF +:param f: a wrapped/native UserDefinedFunction. The UDF can be either row-at-a-time or + scalar vectorized. Grouped vectorized UDFs are not supported. +:return: a wrapped :class:`UserDefinedFunction` + +>>> from pyspark.sql.types import IntegerType +>>> from pyspark.sql.functions import udf +>>> slen = udf(lambda s: len(s), IntegerType()) +>>> _ = spark.udf.registerUDF("slen", slen) +>>> spark.sql("SELECT slen('test')").collect() +[Row(slen(test)=4)] >>> import random >>> from pyspark.sql.functions import udf ->>> from pyspark.sql.types import IntegerType, StringType +>>> from pyspark.sql.types import IntegerType >>> random_udf = udf(lambda: random.randint(0, 100), IntegerType()).asNondeterministic() ->>> newRandom_udf = spark.catalog.registerFunction("random_udf", random_udf, StringType()) +>>> newRandom_udf = spark.catalog.registerUDF("random_udf", random_udf) >>> spark.sql("SELECT random_udf()").collect() # doctest: +SKIP -[Row(random_udf()=u'82')] +[Row(random_udf()=82)] >>> spark.range(1).select(newRandom_udf()).collect() # doctest: +SKIP -[Row(random_udf()=u'62')] +[Row(random_udf()=62)] + +>>> from pyspark.sql.functions import pandas_udf, PandasUDFType +>>> @pandas_udf("integer", PandasUDFType.SCALAR) # doctest: +SKIP +... def add_one(x): +... return x + 1 +... +>>> _ = spark.udf.registerUDF("add_one", add_one) # doctest: +SKIP +>>> spark.sql("SELECT add_one(id) FROM range(10)").collect() # doctest: +SKIP --- End diff -- Although this will be skipped, we should show the result like `[Row(add_one(id)=1), Row(add_one(id)=2), ...]`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160873927 --- Diff: python/pyspark/sql/context.py --- @@ -578,6 +606,9 @@ def __init__(self, sqlContext): def register(self, name, f, returnType=StringType()): return self.sqlContext.registerFunction(name, f, returnType) +def registerUDF(self, name, f): --- End diff -- Maybe we can add doc here by `registerUDF.__doc__ == SQLContext.registerUDF.__doc__` similar to doc for `register`. (We should do it for `registerJavaFunction` and `registerJavaUDAF`, too?) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/20192#discussion_r160874070 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/executor/Dockerfile --- @@ -1,35 +0,0 @@ -# -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -#http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - -FROM spark-base - -# Before building the docker image, first build and make a Spark distribution following -# the instructions in http://spark.apache.org/docs/latest/building-spark.html. -# If this docker file is being used in the context of building your images from a Spark -# distribution, the docker build command should be invoked from the top level directory -# of the Spark distribution. E.g.: -# docker build -t spark-executor:latest -f kubernetes/dockerfiles/executor/Dockerfile . - -COPY examples /opt/spark/examples - -CMD SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && \ -env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt && \ -readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt && \ -if ! [ -z ${SPARK_MOUNTED_CLASSPATH}+x} ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && \ -if ! [ -z ${SPARK_EXECUTOR_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_EXECUTOR_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && \ --- End diff -- to clarify, by that I mean we no longer have the ability to customize different classpath for executor and driver. for reference, see spark.driver.extraClassPath vs spark.executor.extraClassPath --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/20192#discussion_r160874300 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh --- @@ -0,0 +1,97 @@ +#!/bin/bash +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# echo commands to the terminal output +set -ex + +# Check whether there is a passwd entry for the container UID +myuid=$(id -u) +mygid=$(id -g) +uidentry=$(getent passwd $myuid) + +# If there is no passwd entry for the container UID, attempt to create one +if [ -z "$uidentry" ] ; then +if [ -w /etc/passwd ] ; then +echo "$myuid:x:$myuid:$mygid:anonymous uid:$SPARK_HOME:/bin/false" >> /etc/passwd +else +echo "Container ENTRYPOINT failed to add passwd entry for anonymous UID" +fi +fi + +SPARK_K8S_CMD="$1" +if [ -z "$SPARK_K8S_CMD" ]; then + echo "No command to execute has been provided." 1>&2 --- End diff -- we can revisit when we have proper client support. overriding the entrypoint won't do if I want everything else set (eg. SPARK_CLASSPATH) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20195: [SPARK-22972][SQL] Couldn't find corresponding Hive SerD...
Github user xubo245 commented on the issue: https://github.com/apache/spark/pull/20195 ok @dongjoon-hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20195: [SPARK-22972][SQL] Couldn't find corresponding Hi...
Github user xubo245 closed the pull request at: https://github.com/apache/spark/pull/20195 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20229: Update RFormula to use VectorSizeHint & OneHotEncoderEst...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20229 **[Test build #85950 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85950/testReport)** for PR 20229 at commit [`7bad275`](https://github.com/apache/spark/commit/7bad275cd995d22d70a42a5b9932073bc3a5d1e5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20214: [SPARK-23023][SQL] Cast field data to strings in showStr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20214 **[Test build #85951 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85951/testReport)** for PR 20214 at commit [`9cf9954`](https://github.com/apache/spark/commit/9cf995461990e46c007405344481cd802a0d6501). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20206 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85946/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20206 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20206 **[Test build #85946 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85946/testReport)** for PR 20206 at commit [`8c91ff9`](https://github.com/apache/spark/commit/8c91ff9909fecaa36a79397e36edf64d83adac6b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20229: Update RFormula to use VectorSizeHint & OneHotEnc...
GitHub user MrBago opened a pull request: https://github.com/apache/spark/pull/20229 Update RFormula to use VectorSizeHint & OneHotEncoderEstimator. ## What changes were proposed in this pull request? RFormula should use VectorSizeHint & OneHotEncoderEstimator in its pipeline to avoid using the deprecated OneHotEncoder & to ensure the model produced can be used in streaming. ## How was this patch tested? Unit tests. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MrBago/spark rFormula Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20229.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20229 commit 7bad275cd995d22d70a42a5b9932073bc3a5d1e5 Author: Bago AmirbekianDate: 2018-01-11T05:45:27Z Update RFormula to use VectorSizeHint & OneHotEncoderEstimator. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20087 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85944/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20087 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20087 **[Test build #85944 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85944/testReport)** for PR 20087 at commit [`4b89b44`](https://github.com/apache/spark/commit/4b89b44b2b4a9a1ffee01da13a3e55cdd3aa260d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20214: [SPARK-23023][SQL] Cast field data to strings in showStr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20214 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20214: [SPARK-23023][SQL] Cast field data to strings in showStr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20214 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85948/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20214: [SPARK-23023][SQL] Cast field data to strings in showStr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20214 **[Test build #85948 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85948/testReport)** for PR 20214 at commit [`cbccb1b`](https://github.com/apache/spark/commit/cbccb1b3a4f4220730c8e6260dda0e552c6c044b). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20194: [SPARK-22999][SQL]'show databases like command' c...
Github user guoxiaolongzte commented on a diff in the pull request: https://github.com/apache/spark/pull/20194#discussion_r160869162 --- Diff: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 --- @@ -141,7 +141,7 @@ statement (LIKE? pattern=STRING)? #showTables | SHOW TABLE EXTENDED ((FROM | IN) db=identifier)? LIKE pattern=STRING partitionSpec? #showTable -| SHOW DATABASES (LIKE pattern=STRING)? #showDatabases +| SHOW DATABASES (LIKE? pattern=STRING)? #showDatabases --- End diff -- No, I just saw like show tables like can be removed. so I think show databases like can also be removed. Just think it is removed, the operation is more convenient. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20195: [SPARK-22972] Couldn't find corresponding Hive SerDe for...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20195 @xubo245. Since it's merged, could you close your PR now? For the PR against old branches like this, we need to close the PR manually. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20100: [SPARK-22913][SQL] Improved Hive Partition Pruning
Github user ameent commented on the issue: https://github.com/apache/spark/pull/20100 @srowen can you help find someone to review this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20100: [SPARK-22913][SQL] Improved Hive Partition Pruning
Github user ameent commented on the issue: https://github.com/apache/spark/pull/20100 Any updates on this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20194: [SPARK-22999][SQL]'show databases like command' c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20194#discussion_r160867737 --- Diff: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 --- @@ -141,7 +141,7 @@ statement (LIKE? pattern=STRING)? #showTables | SHOW TABLE EXTENDED ((FROM | IN) db=identifier)? LIKE pattern=STRING partitionSpec? #showTable -| SHOW DATABASES (LIKE pattern=STRING)? #showDatabases +| SHOW DATABASES (LIKE? pattern=STRING)? #showDatabases --- End diff -- @guoxiaolongzte . MySQL also doesn't work like that. Do you have any reference for that syntax? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20227: [SPARK-23035] Fix warning: TEMPORARY TABLE ... US...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20227#discussion_r160867160 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -814,7 +814,7 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils { withTempView("tab1") { sql( """ - |CREATE TEMPORARY TABLE tab1 --- End diff -- For this one, I think we should keep the test coverage because we still support this legacy syntax. Could you revert this kind of changes? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20227: [SPARK-23035] Fix warning: TEMPORARY TABLE ... US...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20227#discussion_r160867069 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala --- @@ -33,6 +33,9 @@ class TableAlreadyExistsException(db: String, table: String) class TempTableAlreadyExistsException(table: String) extends AnalysisException(s"Temporary table '$table' already exists") +class TempViewAlreadyExistsException(table: String) + extends AnalysisException(s"Temporary view '$table' already exists") --- End diff -- +1 for adding this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20228: [SPARK-23036] Add withGlobalTempView for testing and cor...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20228 BTW, please put '[SQL]' into your title. Then, your PR will be listed under SQL category here. - https://spark-prs.appspot.com/open-prs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20226: [SPARK-23034][SQL][UI] Display tablename for `HiveTableS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20226 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85945/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20226: [SPARK-23034][SQL][UI] Display tablename for `HiveTableS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20226 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20226: [SPARK-23034][SQL][UI] Display tablename for `HiveTableS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20226 **[Test build #85945 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85945/testReport)** for PR 20226 at commit [`6271804`](https://github.com/apache/spark/commit/62718040b1403dd796315bded9e47983f0ba9d6f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20228: [SPARK-23036] Add withGlobalTempView for testing ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20228#discussion_r160866670 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/GlobalTempViewSuite.scala --- @@ -140,8 +140,8 @@ class GlobalTempViewSuite extends QueryTest with SharedSQLContext { assert(spark.catalog.listTables(globalTempDB).collect().toSeq.map(_.name) == Seq("v1", "v2")) } finally { - spark.catalog.dropTempView("v1") - spark.catalog.dropGlobalTempView("v2") + spark.catalog.dropGlobalTempView("v1") + spark.catalog.dropTempView("v2") --- End diff -- Hi, @xubo245 . Could you split the bug and new improvement into two separate PR? Then, your PR will get reviews more easily. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160864450 --- Diff: python/pyspark/sql/context.py --- @@ -203,18 +203,46 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = sqlContext.udf.register("stringLengthInt", lambda x: len(x), IntegerType()) >>> sqlContext.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" +return self.sparkSession.catalog.registerFunction(name, f, returnType) + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): --- End diff -- Ok. Sounds good to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20211: [SPARK-23011][PYTHON][SQL] Prepend missing groupi...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20211#discussion_r160864143 --- Diff: python/pyspark/sql/group.py --- @@ -233,6 +233,27 @@ def apply(self, udf): | 2| 1.1094003924504583| +---+---+ +Notes on grouping column: --- End diff -- Yeah. To be honest I don't think there is behavior that is both simple and works well with all use cases. It's probably a matter of leaning towards simpler behavior that doesn't work well in some cases or towards somewhat "magic" behavior. I don't think there is an obvious answer here. Another option is to always prepend grouping columns, if users want to return grouping columns in the UDF output, they can do a `drop` after `groupby apply` ``` pandas_udf('id int, v double', GROUP_MAP) def foo(pdf): return pdf.assign(v=pdf.v+1) df.groupby('id').apply(foo).drop(df.id) ``` I don't think it's too annoying to add a drop after apply and it works well with the linear regression case. This is also a pretty straight forward non magical behavior. What do you all think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20151: [SPARK-22959][PYTHON] Configuration to select the...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20151#discussion_r160864049 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala --- @@ -34,17 +34,39 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String import PythonWorkerFactory._ - // Because forking processes from Java is expensive, we prefer to launch a single Python daemon - // (pyspark/daemon.py) and tell it to fork new workers for our tasks. This daemon currently - // only works on UNIX-based systems now because it uses signals for child management, so we can - // also fall back to launching workers (pyspark/worker.py) directly. + // Because forking processes from Java is expensive, we prefer to launch a single Python daemon, + // pyspark/daemon.py (by default) and tell it to fork new workers for our tasks. This daemon + // currently only works on UNIX-based systems now because it uses signals for child management, + // so we can also fall back to launching workers, pyspark/worker.py (by default) directly. val useDaemon = { val useDaemonEnabled = SparkEnv.get.conf.getBoolean("spark.python.use.daemon", true) // This flag is ignored on Windows as it's unable to fork. !System.getProperty("os.name").startsWith("Windows") && useDaemonEnabled } + // WARN: Both configurations, 'spark.python.daemon.module' and 'spark.python.worker.module' are + // for very advanced users and they are experimental. This should be considered + // as expert-only option, and shouldn't be used before knowing what it means exactly. + + // This configuration indicates the module to run the daemon to execute its Python workers. + val daemonModule = SparkEnv.get.conf.getOption("spark.python.daemon.module").map { value => --- End diff -- Hm, actually we could check like .. if it's empty string too. I wrote "shouldn't be used before knowing what it means exactly." above. So, I think it's fine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20072 **[Test build #85949 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85949/testReport)** for PR 20072 at commit [`5230081`](https://github.com/apache/spark/commit/523008192cb1476a68cf3808f46574598fcc6d2d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160863048 --- Diff: python/pyspark/sql/tests.py --- @@ -4085,33 +4091,50 @@ def test_vectorized_udf_timestamps_respect_session_timezone(self): finally: self.spark.conf.set("spark.sql.session.timeZone", orig_tz) -def test_nondeterministic_udf(self): +def test_nondeterministic_vectorized_udf(self): # Test that nondeterministic UDFs are evaluated only once in chained UDF evaluations from pyspark.sql.functions import udf, pandas_udf, col @pandas_udf('double') def plus_ten(v): return v + 10 -random_udf = self.random_udf +random_udf = self.nondeterministic_vectorized_udf df = self.spark.range(10).withColumn('rand', random_udf(col('id'))) result1 = df.withColumn('plus_ten(rand)', plus_ten(df['rand'])).toPandas() self.assertEqual(random_udf.deterministic, False) self.assertTrue(result1['plus_ten(rand)'].equals(result1['rand'] + 10)) -def test_nondeterministic_udf_in_aggregate(self): +def test_nondeterministic_vectorized_udf_in_aggregate(self): from pyspark.sql.functions import pandas_udf, sum df = self.spark.range(10) -random_udf = self.random_udf +random_udf = self.nondeterministic_vectorized_udf with QuietTest(self.sc): with self.assertRaisesRegexp(AnalysisException, 'nondeterministic'): df.groupby(df.id).agg(sum(random_udf(df.id))).collect() with self.assertRaisesRegexp(AnalysisException, 'nondeterministic'): df.agg(sum(random_udf(df.id))).collect() +def test_register_vectorized_udf_basic(self): +from pyspark.rdd import PythonEvalType +from pyspark.sql.functions import pandas_udf, col, expr +df = self.spark.range(10).select( +col('id').cast('int').alias('a'), +col('id').cast('int').alias('b')) +originalAdd = pandas_udf(lambda x, y: x + y, IntegerType()) --- End diff -- `originalAdd` ->`original_add` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160863023 --- Diff: python/pyspark/sql/tests.py --- @@ -4085,33 +4091,50 @@ def test_vectorized_udf_timestamps_respect_session_timezone(self): finally: self.spark.conf.set("spark.sql.session.timeZone", orig_tz) -def test_nondeterministic_udf(self): +def test_nondeterministic_vectorized_udf(self): # Test that nondeterministic UDFs are evaluated only once in chained UDF evaluations from pyspark.sql.functions import udf, pandas_udf, col @pandas_udf('double') def plus_ten(v): return v + 10 -random_udf = self.random_udf +random_udf = self.nondeterministic_vectorized_udf df = self.spark.range(10).withColumn('rand', random_udf(col('id'))) result1 = df.withColumn('plus_ten(rand)', plus_ten(df['rand'])).toPandas() self.assertEqual(random_udf.deterministic, False) self.assertTrue(result1['plus_ten(rand)'].equals(result1['rand'] + 10)) -def test_nondeterministic_udf_in_aggregate(self): +def test_nondeterministic_vectorized_udf_in_aggregate(self): from pyspark.sql.functions import pandas_udf, sum df = self.spark.range(10) -random_udf = self.random_udf +random_udf = self.nondeterministic_vectorized_udf with QuietTest(self.sc): with self.assertRaisesRegexp(AnalysisException, 'nondeterministic'): df.groupby(df.id).agg(sum(random_udf(df.id))).collect() with self.assertRaisesRegexp(AnalysisException, 'nondeterministic'): df.agg(sum(random_udf(df.id))).collect() +def test_register_vectorized_udf_basic(self): +from pyspark.rdd import PythonEvalType +from pyspark.sql.functions import pandas_udf, col, expr +df = self.spark.range(10).select( +col('id').cast('int').alias('a'), +col('id').cast('int').alias('b')) +originalAdd = pandas_udf(lambda x, y: x + y, IntegerType()) +self.assertEqual(originalAdd.deterministic, True) +self.assertEqual(originalAdd.evalType, PythonEvalType.SQL_PANDAS_SCALAR_UDF) +newAdd = self.spark.catalog.registerUDF("add1", originalAdd) --- End diff -- `newAdd` -> `new_add` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160862588 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): --- End diff -- BTW, wouldn't it break the compatibility? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20225: [SPARK-23033] Don't use task level retry for continuous ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20225 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20225: [SPARK-23033] Don't use task level retry for continuous ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20225 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85939/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20211: [SPARK-23011][PYTHON][SQL] Prepend missing groupi...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20211#discussion_r160862182 --- Diff: python/pyspark/sql/tests.py --- @@ -3995,23 +3995,49 @@ def test_coerce(self): self.assertFramesEqual(expected, result) def test_complex_groupby(self): +import pandas as pd from pyspark.sql.functions import pandas_udf, col, PandasUDFType df = self.data +pdf = df.toPandas() @pandas_udf( -'id long, v int, norm double', +'v int, v2 double', PandasUDFType.GROUP_MAP ) -def normalize(pdf): +def foo(pdf): v = pdf.v -return pdf.assign(norm=(v - v.mean()) / v.std()) - -result = df.groupby(col('id') % 2 == 0).apply(normalize).sort('id', 'v').toPandas() -pdf = df.toPandas() -expected = pdf.groupby(pdf['id'] % 2 == 0).apply(normalize.func) -expected = expected.sort_values(['id', 'v']).reset_index(drop=True) -expected = expected.assign(norm=expected.norm.astype('float64')) -self.assertFramesEqual(expected, result) +return pd.DataFrame({'v': v + 1, 'v2': v - v.mean()})[:] --- End diff -- This is just for simplifying the test - pandas has very complicated behavior when it comes to what's the index of the return value when using `groupby apply` If interested, take a look at http://nbviewer.jupyter.org/gist/mbirdi/05f8a83d340476e5f03a --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20225: [SPARK-23033] Don't use task level retry for continuous ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20225 **[Test build #85939 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85939/testReport)** for PR 20225 at commit [`1bf613f`](https://github.com/apache/spark/commit/1bf613f2162ad07289d99c7ef3cbd0a7e2b73558). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160862123 --- Diff: python/pyspark/sql/context.py --- @@ -203,18 +203,46 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = sqlContext.udf.register("stringLengthInt", lambda x: len(x), IntegerType()) >>> sqlContext.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" +return self.sparkSession.catalog.registerFunction(name, f, returnType) + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): --- End diff -- I think that's ideally right. But let's keep it consistent with the convention used here. Let's discuss and try it separately and do it in batch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20204 > @icexelloss, for #20204 (comment), yes, that's the way I usually use too. My worry is though I wonder if this is a proper official way to do it because I have been thinking this way is rather meant to be internal. Sounds good to me. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160861750 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): --- End diff -- Yea, but the wrapped UDF is also callable .. unfortunately :(. I suggested this way for this reason: in `udf.py` ```diff +wrapper._unwrapped = lambda: self return wrapper ``` and then ``` if hasattr(f, "_unwrapped"): f = f._unwrapped() if isinstance(f, UserDefinedFunction): ... else: ... ``` but it was no string opinion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20226: [SPARK-23034][SQL][UI] Display tablename for `HiveTableS...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/20226 @dongjoon-hyun : I have updated the PR description --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160861434 --- Diff: python/pyspark/sql/context.py --- @@ -203,18 +203,46 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = sqlContext.udf.register("stringLengthInt", lambda x: len(x), IntegerType()) >>> sqlContext.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" +return self.sparkSession.catalog.registerFunction(name, f, returnType) + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): --- End diff -- I prefer not to duplicate the doc string. Maybe we can put the docstring in the user facing API (I think it's the SQLContext one?) And reference the doc string in the other one. Or, maybe we can do sth like this if we want both docstrings ``` @ignore_unicode_prefix @since(2.3) def registerUDF(self, name, f): return self.sparkSession.catalog.registerUDF(name, f) registerUDF.__doc__ = pyspark.sql.catalog.Catalog.registerUDF.__doc__ ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20204#discussion_r160860487 --- Diff: python/test_coverage/sitecustomize.py --- @@ -0,0 +1,19 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import coverage +coverage.process_startup() --- End diff -- Yup, will add some comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20211: [SPARK-23011][PYTHON][SQL] Prepend missing groupi...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20211#discussion_r160860320 --- Diff: python/pyspark/sql/group.py --- @@ -233,6 +233,27 @@ def apply(self, udf): | 2| 1.1094003924504583| +---+---+ +Notes on grouping column: --- End diff -- Yup, I saw this usecase as described in the JIRA and I got that the specific case can be simplified; however, I am not sure if it's straightforward to the end users. For example, if I use `pandas_udf` I think I would simply expect the return schema is matched as described in `returnType`. I think `pandas_udf` already need some background and I think we should make it simpler as possible as we can. It might be convenient to make the guarantee on grouping columns in some cases vs this might be a kind of magic inside. I would prefer to let the UDF to specify the grouping columns to make this more straightforward more .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19885: [SPARK-22587] Spark job fails if fs.defaultFS and...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19885 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160859938 --- Diff: python/pyspark/sql/tests.py --- @@ -4147,6 +4170,21 @@ def test_simple(self): expected = df.toPandas().groupby('id').apply(foo_udf.func).reset_index(drop=True) self.assertFramesEqual(expected, result) +def test_registerGroupMapUDF(self): --- End diff -- nit: test_register_group_map_udf --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19885 Let me merge to master and branch 2.3. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160858810 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): --- End diff -- It might be better to check `f` is (1) callable (2) not a UDF object --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160858570 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): +raise ValueError("Please use registerUDF for registering UDF. The expected function of " + "registerFunction is a Python function (including lambda function)") +udf = UserDefinedFunction(f, returnType=returnType, name=name, + evalType=PythonEvalType.SQL_BATCHED_UDF) +self._jsparkSession.udf().registerPython(name, udf._judf) +return udf._wrapped() + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF +:param f: a wrapped/native UserDefinedFunction. The UDF can be either row-at-a-time or + scalar vectorized. Grouped vectorized UDFs are not supported. +:return: a wrapped :class:`UserDefinedFunction` + +>>> from pyspark.sql.types import IntegerType +>>> from pyspark.sql.functions import udf +>>> slen = udf(lambda s: len(s), IntegerType()) +>>> _ = spark.udf.registerUDF("slen", slen) +>>> spark.sql("SELECT slen('test')").collect() +[Row(slen(test)=4)] >>> import random >>> from pyspark.sql.functions import udf ->>> from pyspark.sql.types import IntegerType, StringType +>>> from pyspark.sql.types import IntegerType >>> random_udf = udf(lambda: random.randint(0, 100), IntegerType()).asNondeterministic() ->>> newRandom_udf = spark.catalog.registerFunction("random_udf", random_udf, StringType()) +>>> newRandom_udf = spark.catalog.registerUDF("random_udf", random_udf) >>> spark.sql("SELECT random_udf()").collect() # doctest: +SKIP -[Row(random_udf()=u'82')] +[Row(random_udf()=82)] >>> spark.range(1).select(newRandom_udf()).collect() # doctest: +SKIP -[Row(random_udf()=u'62')] +[Row(random_udf()=62)] + +>>> from pyspark.sql.functions import pandas_udf, PandasUDFType +>>> @pandas_udf("integer", PandasUDFType.SCALAR) # doctest: +SKIP +... def add_one(x): +... return x + 1 +... +>>> _ = spark.udf.registerUDF("add_one", add_one) # doctest: +SKIP +>>> spark.sql("SELECT add_one(id) FROM range(10)").collect() # doctest: +SKIP """ # This is to check whether the input function is a wrapped/native UserDefinedFunction if hasattr(f, 'asNondeterministic'): -udf = UserDefinedFunction(f.func, returnType=returnType, name=name, - evalType=PythonEvalType.SQL_BATCHED_UDF, +if f.evalType not in [PythonEvalType.SQL_BATCHED_UDF, + PythonEvalType.SQL_PANDAS_SCALAR_UDF]: +raise ValueError( +"Invalid f: f must be either SQL_BATCHED_UDF or SQL_PANDAS_SCALAR_UDF") +udf = UserDefinedFunction(f.func, returnType=f.returnType, name=name, + evalType=f.evalType, deterministic=f.deterministic) else: -udf = UserDefinedFunction(f, returnType=returnType, name=name, - evalType=PythonEvalType.SQL_BATCHED_UDF) +raise TypeError("Please use registerFunction for registering a Python function " --- End diff -- +1. I think the user might not necessarily pass a python function here. Maybe a error message like "Invalid UDF: f must be a object returned by `udf` or `pandas_udf`" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160858423 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): +raise ValueError("Please use registerUDF for registering UDF. The expected function of " + "registerFunction is a Python function (including lambda function)") +udf = UserDefinedFunction(f, returnType=returnType, name=name, + evalType=PythonEvalType.SQL_BATCHED_UDF) +self._jsparkSession.udf().registerPython(name, udf._judf) +return udf._wrapped() + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF +:param f: a wrapped/native UserDefinedFunction. The UDF can be either row-at-a-time or + scalar vectorized. Grouped vectorized UDFs are not supported. --- End diff -- I would probably say something like "The UDF can be either returned by `udf` or `pandas_udf` with type `SCALAR`" to be more specific --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20224 We always need to turn on this? It seems this is debug info for developers? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20222 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85935/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20222 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160858034 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): +raise ValueError("Please use registerUDF for registering UDF. The expected function of " + "registerFunction is a Python function (including lambda function)") +udf = UserDefinedFunction(f, returnType=returnType, name=name, + evalType=PythonEvalType.SQL_BATCHED_UDF) +self._jsparkSession.udf().registerPython(name, udf._judf) +return udf._wrapped() + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF --- End diff -- Maybe something like "name of the UDF in SQL statement" to be more specific. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20222 **[Test build #85935 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85935/testReport)** for PR 20222 at commit [`9107f9f`](https://github.com/apache/spark/commit/9107f9f2d71610888fc4c0beac70a4a87c8348d9). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20228: [SPARK-23036] Add withGlobalTempView for testing and cor...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20228 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160857927 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): +raise ValueError("Please use registerUDF for registering UDF. The expected function of " + "registerFunction is a Python function (including lambda function)") +udf = UserDefinedFunction(f, returnType=returnType, name=name, + evalType=PythonEvalType.SQL_BATCHED_UDF) +self._jsparkSession.udf().registerPython(name, udf._judf) +return udf._wrapped() + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF +:param f: a wrapped/native UserDefinedFunction. The UDF can be either row-at-a-time or --- End diff -- I don't have a strong opinion what to describe them, but I think it's easy for user to understand if the description is consistent. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20217#discussion_r160857817 --- Diff: python/pyspark/sql/catalog.py --- @@ -255,26 +255,67 @@ def registerFunction(self, name, f, returnType=StringType()): >>> _ = spark.udf.register("stringLengthInt", len, IntegerType()) >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +""" + +# This is to check whether the input function is a wrapped/native UserDefinedFunction +if hasattr(f, 'asNondeterministic'): +raise ValueError("Please use registerUDF for registering UDF. The expected function of " + "registerFunction is a Python function (including lambda function)") +udf = UserDefinedFunction(f, returnType=returnType, name=name, + evalType=PythonEvalType.SQL_BATCHED_UDF) +self._jsparkSession.udf().registerPython(name, udf._judf) +return udf._wrapped() + +@ignore_unicode_prefix +@since(2.3) +def registerUDF(self, name, f): +"""Registers a :class:`UserDefinedFunction`. The registered UDF can be used in SQL +statement. + +:param name: name of the UDF +:param f: a wrapped/native UserDefinedFunction. The UDF can be either row-at-a-time or --- End diff -- We should probably standardize the how to describe the wrapped function object returned by `udf` and `pandas_udf` in param docstring. Currently it's being described differently: https://github.com/apache/spark/blob/master/python/pyspark/sql/group.py#L215 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20228: [SPARK-23036] Add withGlobalTempView for testing ...
GitHub user xubo245 opened a pull request: https://github.com/apache/spark/pull/20228 [SPARK-23036] Add withGlobalTempView for testing and correct some roper with view related method usage ## What changes were proposed in this pull request? Add withGlobalTempView when create global temp view, like withTempView and withView. And correct some improper usage. ## How was this patch tested? no new test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/xubo245/spark DropTempView Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20228.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20228 commit fffd109e8c084f9a4d63840bf761364f1ede5dc9 Author: xubo245 <601450868@...> Date: 2018-01-11T03:25:17Z [SPARK-23036] Add withGlobalTempView for testing and correct some improper with view related method usage Add withGlobalTempView when create global temp view, like withTempView and withView. And correct some improper usage. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20226: [SPARK-23034][SQL][UI] Display tablename for `HiveTableS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20226 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20226: [SPARK-23034][SQL][UI] Display tablename for `HiveTableS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20226 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85943/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20226: [SPARK-23034][SQL][UI] Display tablename for `HiveTableS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20226 **[Test build #85943 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85943/testReport)** for PR 20226 at commit [`6271804`](https://github.com/apache/spark/commit/62718040b1403dd796315bded9e47983f0ba9d6f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20223: [SPARK-23020][core] Fix races in launcher code, test.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20223 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20223: [SPARK-23020][core] Fix races in launcher code, test.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20223 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85937/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20223: [SPARK-23020][core] Fix races in launcher code, test.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20223 **[Test build #85937 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85937/testReport)** for PR 20223 at commit [`2018e29`](https://github.com/apache/spark/commit/2018e295f516d51e8c1661d3b77c31a029fdb006). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20214: [SPARK-23023][SQL] Cast field data to strings in showStr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20214 **[Test build #85948 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85948/testReport)** for PR 20214 at commit [`cbccb1b`](https://github.com/apache/spark/commit/cbccb1b3a4f4220730c8e6260dda0e552c6c044b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20214: [SPARK-23023][SQL] Cast field data to strings in showStr...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20214 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20214: [SPARK-23023][SQL] Cast field data to strings in showStr...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20214 `org.apache.spark.sql.streaming.StreamingOuterJoinSuite` is flaky? (It seems this pr is not related to the test). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19872 **[Test build #85947 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85947/testReport)** for PR 19872 at commit [`cb36227`](https://github.com/apache/spark/commit/cb362274711c1b26ed19e87aa15bc8c64668eae6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org