[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20210 Hi, @HyukjinKwon . It seems that branch-2.3 doesn't have this. Could you merge to branch-2.3, too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20096 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20056: [SPARK-22878] [CORE] Count totalDroppedEvents for LiveLi...
Github user squito commented on the issue: https://github.com/apache/spark/pull/20056 I have the same question as @jiangxb1987 , what is the situation where you'd use this metric? the jira doesn't say either. seems like existing metrics mostly cover this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20096 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85926/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20096 **[Test build #85926 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85926/testReport)** for PR 20096 at commit [`f94b53e`](https://github.com/apache/spark/commit/f94b53e3ab7e37fdcb9f34cf7d1313a4905fa341). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19885 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85931/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19885 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19885 **[Test build #85931 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85931/testReport)** for PR 19885 at commit [`296a19f`](https://github.com/apache/spark/commit/296a19fc5b1881959c7cf52b3c6e33eb7fa12b57). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20223: [SPARK-23020][core] Fix races in launcher code, test.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20223 **[Test build #85929 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85929/testReport)** for PR 20223 at commit [`5139f60`](https://github.com/apache/spark/commit/5139f605904996667d8d97941172bbe9d366a579). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19885 **[Test build #85931 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85931/testReport)** for PR 19885 at commit [`296a19f`](https://github.com/apache/spark/commit/296a19fc5b1881959c7cf52b3c6e33eb7fa12b57). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20174 **[Test build #85930 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85930/testReport)** for PR 20174 at commit [`1d2771c`](https://github.com/apache/spark/commit/1d2771c99a6901061b2703ea34ab4ecda342af1c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20192: [SPARK-22994][k8s] Use a single image for all Spark cont...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/20192 If there's no more feedback I'll merge this later today. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19041: [SPARK-21097][CORE] Add option to recover cached data
Github user brad-kaiser commented on the issue: https://github.com/apache/spark/pull/19041 Lol, no worries. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...
GitHub user vanzin opened a pull request: https://github.com/apache/spark/pull/20223 [SPARK-23020][core] Fix races in launcher code, test. The race in the code is because the handle might update its state to the wrong state if the connection handling thread is still processing incoming data; so the handle needs to wait for the connection to finish up before checking the final state. The race in the test is because when waiting for a handle to reach a final state, the waitFor() method needs to wait until all handle state is updated (which also includes waiting for the connection thread above to finish). Otherwise, waitFor() may return too early, which would cause a bunch of different races (like the listener not being yet notified of the state change, or being in the middle of being notified, or the handle not being properly disposed and causing postChecks() to assert). On top of that I found, by code inspection, a couple of potential races that could make a handle end up in the wrong state when being killed. Tested by running the existing unit tests a lot (and not seeing the errors I was seeing before). You can merge this pull request into a Git repository by running: $ git pull https://github.com/vanzin/spark SPARK-23020 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20223.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20223 commit 5139f605904996667d8d97941172bbe9d366a579 Author: Marcelo VanzinDate: 2018-01-10T17:56:17Z [SPARK-23020][core] Fix races in launcher code, test. The race in the code is because the handle might update its state to the wrong state if the connection handling thread is still processing incoming data; so the handle needs to wait for the connection to finish up before checking the final state. The race in the test is because when waiting for a handle to reach a final state, the waitFor() method needs to wait until all handle state is updated (which also includes waiting for the connection thread above to finish). Otherwise, waitFor() may return too early, which would cause a bunch of different races (like the listener not being yet notified of the state change, or being in the middle of being notified, or the handle not being properly disposed and causing postChecks() to assert). On top of that I found, by code inspection, a couple of potential races that could make a handle end up in the wrong state when being killed. Tested by running the existing unit tests a lot (and not seeing the errors I was seeing before). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20013: [SPARK-20657][core] Speed up rendering of the stages pag...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20013 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85924/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20013: [SPARK-20657][core] Speed up rendering of the stages pag...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20013 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20013: [SPARK-20657][core] Speed up rendering of the stages pag...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20013 **[Test build #85924 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85924/testReport)** for PR 20013 at commit [`34d02b2`](https://github.com/apache/spark/commit/34d02b2c05f42a7992de85b7d0c447bba738d2a8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19041: [SPARK-21097][CORE] Add option to recover cached data
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/19041 > Is there anything else you need for this PR? An extra day on my work week... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19041: [SPARK-21097][CORE] Add option to recover cached data
Github user brad-kaiser commented on the issue: https://github.com/apache/spark/pull/19041 Hey @vanzin just wanted to check in on this. Is there anything else you need for this PR? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20218 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85925/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20218 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20218 **[Test build #85925 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85925/testReport)** for PR 20218 at commit [`7924e28`](https://github.com/apache/spark/commit/7924e28d26623a0ba0a7a67cb6994e9ee0220677). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20222 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20222 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85923/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20222 **[Test build #85923 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85923/testReport)** for PR 20222 at commit [`9107f9f`](https://github.com/apache/spark/commit/9107f9f2d71610888fc4c0beac70a4a87c8348d9). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/19872 @ueshin and @HyukjinKwon Thanks much for the review. I have addressed the latest comment. @ueshin I think UDAF that supports partial aggregation can be build on top of this. The questions you asked about UDAF interfaces are very good. I haven't thought them through. I think it will probably end up to be another API for define pandas UDF (has `update` `merge` `finalize` methods, for instance) and another physical plan. I think we can leave that for the future. Is there anything specific issue with regard to UDAF that supports partial aggregation that you want to address here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19872 **[Test build #85928 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85928/testReport)** for PR 19872 at commit [`cb36227`](https://github.com/apache/spark/commit/cb362274711c1b26ed19e87aa15bc8c64668eae6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r160779041 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala --- @@ -15,12 +15,30 @@ * limitations under the License. */ -package org.apache.spark.sql.execution.python +package org.apache.spark.sql.catalyst.expressions -import org.apache.spark.api.python.PythonFunction -import org.apache.spark.sql.catalyst.expressions.{Expression, NonSQLExpression, Unevaluable, UserDefinedExpression} +import org.apache.spark.api.python.{PythonEvalType, PythonFunction} +import org.apache.spark.sql.catalyst.util.toPrettySQL import org.apache.spark.sql.types.DataType +/** + * Helper functions for PythonUDF + */ +object PythonUDF { + def isScalarPythonUDF(e: Expression): Boolean = { +e.isInstanceOf[PythonUDF] && + Set( --- End diff -- Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r160779007 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala --- @@ -0,0 +1,152 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.python + +import java.io.File + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.{SparkEnv, TaskContext} +import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, ClusteredDistribution, Distribution, Partitioning} +import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode} +import org.apache.spark.sql.types.{DataType, StructField, StructType} +import org.apache.spark.util.Utils + +/** + * Physical node for aggregation with group aggregate Pandas UDF. + * + * This plan works by sending the necessary (projected) input grouped data as Arrow record batches + * to the python worker, the python worker invokes the UDF and sends the results to the executor, + * finally the executor evaluates any post-aggregation expressions and join the result with the + * grouped key. + */ +case class AggregateInPandasExec( +groupingExpressions: Seq[NamedExpression], +udfExpressions: Seq[PythonUDF], +resultExpressions: Seq[NamedExpression], +child: SparkPlan) + extends UnaryExecNode { + + override val output: Seq[Attribute] = resultExpressions.map(_.toAttribute) + + override def outputPartitioning: Partitioning = child.outputPartitioning + + override def producedAttributes: AttributeSet = AttributeSet(output) + + override def requiredChildDistribution: Seq[Distribution] = { +if (groupingExpressions.isEmpty) { + AllTuples :: Nil +} else { + ClusteredDistribution(groupingExpressions) :: Nil +} + } + + private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, Seq[Expression]) = { +udf.children match { + case Seq(u: PythonUDF) => +val (chained, children) = collectFunctions(u) +(ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children) + case children => +// There should not be any other UDFs, or the children can't be evaluated directly. +assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty)) +(ChainedPythonFunctions(Seq(udf.func)), udf.children) +} + } + + override def requiredChildOrdering: Seq[Seq[SortOrder]] = +Seq(groupingExpressions.map(SortOrder(_, Ascending))) + + override protected def doExecute(): RDD[InternalRow] = { +val inputRDD = child.execute() + +val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536) +val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true) +val sessionLocalTimeZone = conf.sessionLocalTimeZone +val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone + +val (pyFuncs, inputs) = udfExpressions.map(collectFunctions).unzip + +val allInputs = new ArrayBuffer[Expression] +val dataTypes = new ArrayBuffer[DataType] +val argOffsets = inputs.map { input => + input.map { e => +if (allInputs.exists(_.semanticEquals(e))) { + allInputs.indexWhere(_.semanticEquals(e)) +} else { + allInputs += e + dataTypes += e.dataType + allInputs.length - 1 +} + }.toArray +}.toArray + +val schema = StructType(dataTypes.zipWithIndex.map { case (dt, i) => + StructField(s"_$i", dt) +}) + +val input =
[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r160778894 --- Diff: python/pyspark/sql/tests.py --- @@ -4052,6 +4045,386 @@ def test_unsupported_types(self): df.groupby('id').apply(f).collect() +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed") +class GroupbyAggTests(ReusedSQLTestCase): + +@property +def data(self): +from pyspark.sql.functions import array, explode, col, lit +return self.spark.range(10).toDF('id') \ +.withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \ +.withColumn("v", explode(col('vs'))) \ +.drop('vs') \ +.withColumn('w', lit(1.0)) + +@property +def plus_one(self): +from pyspark.sql.functions import udf + +@udf('double') +def plus_one(v): +assert isinstance(v, (int, float)) +return v + 1 +return plus_one + +@property +def plus_two(self): +import pandas as pd +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.SCALAR) +def plus_two(v): +assert isinstance(v, pd.Series) +return v + 2 +return plus_two + +@property +def mean_udf(self): +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.GROUP_AGG) +def mean_udf(v): +return v.mean() +return mean_udf + +@property +def sum_udf(self): +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.GROUP_AGG) +def sum_udf(v): +return v.sum() +return sum_udf + +@property +def weighted_mean_udf(self): +import numpy as np +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.GROUP_AGG) +def weighted_mean_udf(v, w): +return np.average(v, weights=w) +return weighted_mean_udf + +def test_basic(self): +from pyspark.sql.functions import col, lit, sum, mean + +df = self.data +weighted_mean_udf = self.weighted_mean_udf + +result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id') +expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort('id') +self.assertPandasEqual(expected1.toPandas(), result1.toPandas()) + +result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\ +.sort(df.id + 1) +expected2 = df.groupby((col('id') + 1))\ +.agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort(df.id + 1) +self.assertPandasEqual(expected2.toPandas(), result2.toPandas()) + +result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id') +expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, w)')).sort('id') +self.assertPandasEqual(expected3.toPandas(), result3.toPandas()) + +result4 = df.groupby((col('id') + 1).alias('id'))\ +.agg(weighted_mean_udf(df.v, df.w))\ +.sort('id') +expected4 = df.groupby((col('id') + 1).alias('id'))\ +.agg(mean(df.v).alias('weighted_mean_udf(v, w)'))\ +.sort('id') +self.assertPandasEqual(expected4.toPandas(), result4.toPandas()) + +def test_array(self): +from pyspark.sql.types import ArrayType, DoubleType +from pyspark.sql.functions import pandas_udf, PandasUDFType + +with QuietTest(self.sc): +with self.assertRaisesRegexp(NotImplementedError, 'not supported'): +@pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUP_AGG) +def mean_and_std_udf(v): +return [v.mean(), v.std()] + +def test_struct(self): +from pyspark.sql.functions import pandas_udf, PandasUDFType + +with QuietTest(self.sc): +with self.assertRaisesRegexp(NotImplementedError, 'not supported'): +@pandas_udf('mean double, std double', PandasUDFType.GROUP_AGG) +def mean_and_std_udf(v): +return (v.mean(), v.std()) + +def test_alias(self): +from pyspark.sql.functions import mean + +df = self.data +mean_udf = self.mean_udf +
[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r160778794 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -153,11 +153,20 @@ trait CheckAnalysis extends PredicateHelper { s"of type ${condition.dataType.simpleString} is not a boolean.") case Aggregate(groupingExprs, aggregateExprs, child) => +def isAggregateExpression(expr: Expression) = { + expr.isInstanceOf[AggregateExpression] || +PythonUDF.isGroupAggPandasUDF(expr) --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r160778766 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala --- @@ -271,9 +272,14 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] { case PhysicalAggregation( namedGroupingExpressions, aggregateExpressions, rewrittenResultExpressions, child) => +require( + !aggregateExpressions.exists(PythonUDF.isGroupAggPandasUDF), + "Streaming aggregation doesn't support group aggregate pandas UDF" +) --- End diff -- Done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r160774506 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochCoordinator.scala --- @@ -39,6 +39,15 @@ private[continuous] sealed trait EpochCoordinatorMessage extends Serializable */ private[sql] case object IncrementAndGetEpoch extends EpochCoordinatorMessage +/** --- End diff -- looks good to me --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20222 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20222 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85922/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20174#discussion_r160772523 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala --- @@ -198,6 +198,20 @@ class ReplaceOperatorSuite extends PlanTest { comparePlans(optimized, correctAnswer) } + test("add one grouping key if necessary when replace Deduplicate with Aggregate") { +val input = LocalRelation() +val query = Deduplicate(Seq.empty, input) // dropDuplicates() +val optimized = Optimize.execute(query.analyze) + +val correctAnswer = --- End diff -- nit: this can be all in one line --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20222 **[Test build #85922 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85922/testReport)** for PR 20222 at commit [`3e1d6d6`](https://github.com/apache/spark/commit/3e1d6d6ab58411491f9620ee7e568d002759ea58). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/20174#discussion_r160770992 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala --- @@ -666,4 +665,16 @@ class DataFrameAggregateSuite extends QueryTest with SharedSQLContext { assert(exchangePlans.length == 1) } } + + Seq(true, false).foreach { codegen => +test("SPARK-22951: dropDuplicates on empty data frames should produce correct aggregate" + + s" results when codegen enabled: $codegen") { + withSQLConf((SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, codegen.toString)) { +assert(Seq.empty[Int].toDF("a").count() == 0) +assert(Seq.empty[Int].toDF("a").agg(count("*")).count() == 1) +assert(spark.emptyDataFrame.dropDuplicates().count() == 0) + assert(spark.emptyDataFrame.dropDuplicates().agg(count("*")).count() == 1) --- End diff -- @liufengdb Maybe also add assertions to confirm that explicit global aggregations (by providing zero grouping keys) still return one row? For example: ```scala val emptyAgg = Map.empty[String, String] checkAnswer( spark.emptyDataFrame.agg(emptyAgg), Seq(Row()) ) checkAnswer( spark.emptyDataFrame.groupBy().agg(emptyAgg), Seq(Row()) ) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20222 Wait dev/release-tag.sh does this automatically though. I just want to make sure we are not missing things (like R/DESCRIPTION) I think maybe we should run a subset of release-tag --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user felixcheungu commented on the issue: https://github.com/apache/spark/pull/20222 @felixcheung --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20153#discussion_r160764573 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java --- @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.reader; + +import java.util.List; + +import org.apache.spark.annotation.InterfaceStability; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.vectorized.ColumnarBatch; + +/** + * A mix-in interface for {@link DataSourceV2Reader}. Data source readers can implement this + * interface to output {@link ColumnarBatch} and make the scan faster. + */ +@InterfaceStability.Evolving +public interface SupportsScanColumnarBatch extends DataSourceV2Reader { + @Override + default ListcreateReadTasks() { +throw new IllegalStateException( + "createReadTasks should not be called with SupportsScanColumnarBatch."); + } + + /** + * Similar to {@link DataSourceV2Reader#createReadTasks()}, but returns columnar data in batches. + */ + List createBatchReadTasks(); + + /** + * A safety door for columnar batch reader. It's possible that the implementation can only support + * some certain columns with certain types. Users can overwrite this method and + * {@link #createReadTasks()} to fallback to normal read path under some conditions. + */ + default boolean enableBatchRead() { --- End diff -- If it controls batch mode or non-batch mode, I agree. IIUC, this value is used to show whether we enable to read data from column-oriented storage (e.g. `ColumnarVector`) or row-oriented storage (e.g. `UnsafeRow`). I feel that it is not a batch mode. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20174 I see, thnaks for your answer @liancheng --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/20174 @mgaido91 We can't because we do not know whether there are any input rows or not. For example: ```scala val df1 = spark.range(10).select() val df2 = spark.range(10).filter($"id" < 0).select() val df3 = df1.dropDuplicates() val df4 = df2.dropDuplicates() ``` `df1` has zero columns and ten rows while `df2` has no columns and zero rows. Therefore, `df3` should return one row containing zero columns while `df4` should return zero rows. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20174 @mgaido91 Your proposal and current approach are both with one line change. Since the issue is actually related to the hash aggregate implementation, I think it is reasonable to include it in the `ReplaceDeduplicateWithAggregate` transformation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20174 **[Test build #85927 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85927/testReport)** for PR 20174 at commit [`8418536`](https://github.com/apache/spark/commit/8418536f29ee1375018d731c580caa9e3d418bd0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/20174#discussion_r160757582 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1221,7 +1221,12 @@ object ReplaceDeduplicateWithAggregate extends Rule[LogicalPlan] { Alias(new First(attr).toAggregateExpression(), attr.name)(attr.exprId) } } - Aggregate(keys, aggCols, child) + // SPARK-22951: the implementation of aggregate operator treats the cases with and without + // grouping keys differently, when there are not input rows. For the aggregation after + // `dropDuplicates()` on an empty data frame, a grouping key is added here to make sure the + // aggregate operator can work correctly (returning an empty iterator). --- End diff -- ok, I like this one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20096 **[Test build #85926 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85926/testReport)** for PR 20096 at commit [`f94b53e`](https://github.com/apache/spark/commit/f94b53e3ab7e37fdcb9f34cf7d1313a4905fa341). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user jose-torres commented on the issue: https://github.com/apache/spark/pull/20096 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20218 **[Test build #85925 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85925/testReport)** for PR 20218 at commit [`7924e28`](https://github.com/apache/spark/commit/7924e28d26623a0ba0a7a67cb6994e9ee0220677). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20218 Since this is a flakiness issue, I retriggered it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20218 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20221: [SPARK-23019][Core] Wait until SparkContext.stop(...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20221 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18991: [SPARK-21783][SQL] Turn on ORC filter push-down by defau...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/18991 @gatorsmile . I don't have any numbers for PPD=false. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20219 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20219 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85921/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20219 **[Test build #85921 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85921/testReport)** for PR 20219 at commit [`3d04d4d`](https://github.com/apache/spark/commit/3d04d4dfd5b3d320ec494ec838dd299f654d6c98). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20221: [SPARK-23019][Core] Wait until SparkContext.stop() finis...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/20221 Merging to master / 2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20221: [SPARK-23019][Core] Wait until SparkContext.stop() finis...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/20221 Oh, wait, yes, this should fix SPARK-23019 (since that's caused by the context being still alive). I'll keep investigating SPARK-23020 separately. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20153#discussion_r160747322 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java --- @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.reader; + +import java.util.List; + +import org.apache.spark.annotation.InterfaceStability; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.vectorized.ColumnarBatch; + +/** + * A mix-in interface for {@link DataSourceV2Reader}. Data source readers can implement this + * interface to output {@link ColumnarBatch} and make the scan faster. + */ +@InterfaceStability.Evolving +public interface SupportsScanColumnarBatch extends DataSourceV2Reader { + @Override + default ListcreateReadTasks() { +throw new IllegalStateException( + "createReadTasks should not be called with SupportsScanColumnarBatch."); + } + + /** + * Similar to {@link DataSourceV2Reader#createReadTasks()}, but returns columnar data in batches. + */ + List createBatchReadTasks(); + + /** + * A safety door for columnar batch reader. It's possible that the implementation can only support + * some certain columns with certain types. Users can overwrite this method and + * {@link #createReadTasks()} to fallback to normal read path under some conditions. + */ + default boolean enableBatchRead() { --- End diff -- This name is more general. It looks fine to me. In the future, if we support another batch read mode, we can add the extra function to further identify the batch mode. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20153#discussion_r160746791 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java --- @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.reader; + +import java.util.List; + +import org.apache.spark.annotation.InterfaceStability; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.vectorized.ColumnarBatch; + +/** + * A mix-in interface for {@link DataSourceV2Reader}. Data source readers can implement this + * interface to output {@link ColumnarBatch} and make the scan faster. --- End diff -- We need to explain the precedence of `SupportsScanColumnarBatch ` and `SupportsScanUnsafeRow` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20221: [SPARK-23019][Core] Wait until SparkContext.stop() finis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20221 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85919/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20221: [SPARK-23019][Core] Wait until SparkContext.stop() finis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20221 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20221: [SPARK-23019][Core] Wait until SparkContext.stop() finis...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20221 **[Test build #85919 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85919/testReport)** for PR 20221 at commit [`16b985c`](https://github.com/apache/spark/commit/16b985c1018c0f5f54ef1062bdd7960cc4a87b39). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20013: [SPARK-20657][core] Speed up rendering of the stages pag...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20013 **[Test build #85924 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85924/testReport)** for PR 20013 at commit [`34d02b2`](https://github.com/apache/spark/commit/34d02b2c05f42a7992de85b7d0c447bba738d2a8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/20222 Oh I see, they're just not listed on https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ . I believe I can fix that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20222: [SPARK-23028] Bump master branch version to 2.4.0...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/20222#discussion_r160741170 --- Diff: project/MimaExcludes.scala --- @@ -34,6 +34,10 @@ import com.typesafe.tools.mima.core.ProblemFilters._ */ object MimaExcludes { + // Exclude rules for 2.4.x --- End diff -- We might want to make my proposed MiMa changes in a separate patch so the same set of reviewed changes can go to both master and branch-2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20222: [SPARK-23028] Bump master branch version to 2.4.0...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/20222#discussion_r160740904 --- Diff: project/MimaExcludes.scala --- @@ -34,6 +34,10 @@ import com.typesafe.tools.mima.core.ProblemFilters._ */ object MimaExcludes { + // Exclude rules for 2.4.x --- End diff -- This reminds me that we should probably bump `previousSparkVersion` in `MimaBuild.scala` to be 2.2.0: https://github.com/apache/spark/blame/f340b6b3066033d40b7e163fd5fb68e9820adfb1/project/MimaBuild.scala#L91. I think this should happen for both master and branch-2.3. See #15061 for an example of a similar change when 2.1.x was being prepared (it looks like we missed this step for 2.2.0). We may also need to un-exclude any new subprojects / artifacts that were added in 2.2.0 since they now need to be backwards-compatibility-tested for 2.3.x. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/20222 We should already be set up for 2.3.x builds in AMPLab Jenkins. For example: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.3-test-maven-hadoop-2.6/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20210 Thanks @ueshin and @HyukjinKwon ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20213: [SPARK-23018][PYTHON] Fix createDataFrame from Pandas ti...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20213 Thanks @ueshin and @HyukjinKwon ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20072 cc @CodingCat --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18991: [SPARK-21783][SQL] Turn on ORC filter push-down by defau...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18991 What is the performance number when turning it on, compared with the off mode? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20219 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85920/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20219 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20219 **[Test build #85920 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85920/testReport)** for PR 20219 at commit [`8e821a6`](https://github.com/apache/spark/commit/8e821a62135d63b937851372176ab053125cc9f5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20203: [SPARK-22577] [core] executor page blacklist status shou...
Github user attilapiros commented on the issue: https://github.com/apache/spark/pull/20203 cc @tgravescs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20151: [SPARK-22959][PYTHON] Configuration to select the module...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20151 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20151: [SPARK-22959][PYTHON] Configuration to select the module...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20151 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85918/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20151: [SPARK-22959][PYTHON] Configuration to select the module...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20151 **[Test build #85918 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85918/testReport)** for PR 20151 at commit [`fc65803`](https://github.com/apache/spark/commit/fc658034639c1aa56ff5b9a44624cad05377fe51). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20151: [SPARK-22959][PYTHON] Configuration to select the module...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20151 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85917/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20151: [SPARK-22959][PYTHON] Configuration to select the module...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20151 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20151: [SPARK-22959][PYTHON] Configuration to select the module...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20151 **[Test build #85917 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85917/testReport)** for PR 20151 at commit [`ea5b987`](https://github.com/apache/spark/commit/ea5b987d59f415045a2a890d9e4cf30198d82717). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20222 **[Test build #85923 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85923/testReport)** for PR 20222 at commit [`9107f9f`](https://github.com/apache/spark/commit/9107f9f2d71610888fc4c0beac70a4a87c8348d9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20221: [SPARK-23019][Core] Wait until SparkContext.stop(...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/20221#discussion_r160719518 --- Diff: core/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java --- @@ -133,6 +134,10 @@ public void testInProcessLauncher() throws Exception { p.put(e.getKey(), e.getValue()); } System.setProperties(p); + // Here DAGScheduler is stopped, while SparkContext.clearActiveContext may not be called yet. + // Wait for a reasonable amount of time to avoid creating two active SparkContext in JVM. + // See SPARK-23019 and SparkContext.stop() for details. + TimeUnit.MILLISECONDS.sleep(500); --- End diff -- Yep, either way should work. Using `TimeUnit` is just following `BaseSuite.java`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/20222 `R/pkg/DESCRIPTION` needs an update too. I'll update the release process. @JoshRosen or @shaneknapp looks like we also need to have set up the branch-2.3 tests in Jenkins. Right now it's not tested. I might have access to do it. Is it safe to copy and paste existing jobs and change the branch reference? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20208 Also, ping @rxin , too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18991: [SPARK-21783][SQL] Turn on ORC filter push-down by defau...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/18991 Yes, @cloud-fan . I added the same test coverage for ORC in Apache Spark. Sorry, @gatorsmile . I always turned on PPD, so there is no perf number for PPD=false. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20222 @srowen Yeah. I am still in travel mode. : ) Please help do it. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20195: [SPARK-22972] Couldn't find corresponding Hive SerDe for...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20195 Thanks! Merged to 2.2 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20199: [Spark-22967][TESTS]Fix VersionSuite's unit tests...
Github user Ngone51 commented on a diff in the pull request: https://github.com/apache/spark/pull/20199#discussion_r160709123 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -842,6 +842,7 @@ class VersionsSuite extends SparkFunSuite with Logging { } test(s"$version: SPARK-17920: Insert into/overwrite avro table") { + assume(!(Utils.isWindows && version == "0.12")) --- End diff -- @HyukjinKwon Ok, will comment. And I have find serval issues which exactly describe the same problem: - https://issues.apache.org/jira/browse/SPARK-12216 - https://issues.apache.org/jira/browse/SPARK-8333 - https://issues.apache.org/jira/browse/SPARK-18979 It seems this is a common issue exists on Windows and haven't been resolved yet. And I didn't find a accurate cause for the error from those issues. And now, I doubt this problem maybe related to URLClassLoader. And, I have reproduce the same issue after I did a experiment with URLClassLoader. Though, I'm not sure this is the real cause, yet. Working on this. We can merge this pr to master after I add comment. And open a new pr of that issue(if fixed) or communicate under one of those issues. WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r160708666 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala --- @@ -15,12 +15,30 @@ * limitations under the License. */ -package org.apache.spark.sql.execution.python +package org.apache.spark.sql.catalyst.expressions -import org.apache.spark.api.python.PythonFunction -import org.apache.spark.sql.catalyst.expressions.{Expression, NonSQLExpression, Unevaluable, UserDefinedExpression} +import org.apache.spark.api.python.{PythonEvalType, PythonFunction} +import org.apache.spark.sql.catalyst.util.toPrettySQL import org.apache.spark.sql.types.DataType +/** + * Helper functions for PythonUDF + */ +object PythonUDF { + def isScalarPythonUDF(e: Expression): Boolean = { +e.isInstanceOf[PythonUDF] && + Set( --- End diff -- Aha, good call. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r160708246 --- Diff: python/pyspark/sql/tests.py --- @@ -511,7 +517,6 @@ def test_udf_with_order_by_and_limit(self): my_copy = udf(lambda x: x, IntegerType()) df = self.spark.range(10).orderBy("id") res = df.select(df.id, my_copy(df.id).alias("copy")).limit(1) -res.explain(True) --- End diff -- Yes, I think this is removed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/20222 Yeah ideally that's part of the branching process. I think it could be documented in the "Preparing Spark for Release" section of `release-process.md` in `spark-website`. Open up a separate PR for that if you like, or I can do it if you're not set up for it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19802: [SPARK-22594][CORE] Handling spark-submit and master ver...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19802 I see, yeah that's a fine point. Don't rethrow then. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19802: [SPARK-22594][CORE] Handling spark-submit and master ver...
Github user Jiri-Kremser commented on the issue: https://github.com/apache/spark/pull/19802 > I think it's probably OK if you also re-throw that exception for now. You're just giving more info then. Hmm, In the original code the exception wasn't re-thrown. Do you really think it is a good idea? The `InvalidClassException` is a checked exception so the signature of `processOneWayMessage()` method would need to be change as well. this is the original catch-them-all code :) ```java } catch (Exception e) { logger.error("Error while invoking RpcHandler#receive() for one-way message.", e); } ``` I think, I addressed all your inline other comments, thanks for the review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20222 **[Test build #85922 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85922/testReport)** for PR 20222 at commit [`3e1d6d6`](https://github.com/apache/spark/commit/3e1d6d6ab58411491f9620ee7e568d002759ea58). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20222 cc @yhuai @sameeragarwal @JoshRosen @rxin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20222: [SPARK-23028] Bump master branch version to 2.4.0...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/20222 [SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT ## What changes were proposed in this pull request? This patch bumps the master branch version to `2.4.0-SNAPSHOT`. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark bump24 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20222.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20222 commit 932d93e1bbeecd9f8bec0b4d80e42f7e244787e6 Author: gatorsmileDate: 2018-01-10T14:57:34Z bump commit 3e1d6d6ab58411491f9620ee7e568d002759ea58 Author: gatorsmile Date: 2018-01-10T14:59:37Z bump --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20219 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85916/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20219 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org