[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...

2018-01-10 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20210
  
Hi, @HyukjinKwon .
It seems that branch-2.3 doesn't have this.
Could you merge to branch-2.3, too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20096
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20056: [SPARK-22878] [CORE] Count totalDroppedEvents for LiveLi...

2018-01-10 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/20056
  
I have the same question as @jiangxb1987 , what is the situation where 
you'd use this metric?  the jira doesn't say either.  seems like existing 
metrics mostly cover this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20096
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85926/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20096
  
**[Test build #85926 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85926/testReport)**
 for PR 20096 at commit 
[`f94b53e`](https://github.com/apache/spark/commit/f94b53e3ab7e37fdcb9f34cf7d1313a4905fa341).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19885
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85931/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19885
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19885
  
**[Test build #85931 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85931/testReport)**
 for PR 19885 at commit 
[`296a19f`](https://github.com/apache/spark/commit/296a19fc5b1881959c7cf52b3c6e33eb7fa12b57).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20223: [SPARK-23020][core] Fix races in launcher code, test.

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20223
  
**[Test build #85929 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85929/testReport)**
 for PR 20223 at commit 
[`5139f60`](https://github.com/apache/spark/commit/5139f605904996667d8d97941172bbe9d366a579).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19885
  
**[Test build #85931 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85931/testReport)**
 for PR 19885 at commit 
[`296a19f`](https://github.com/apache/spark/commit/296a19fc5b1881959c7cf52b3c6e33eb7fa12b57).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20174
  
**[Test build #85930 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85930/testReport)**
 for PR 20174 at commit 
[`1d2771c`](https://github.com/apache/spark/commit/1d2771c99a6901061b2703ea34ab4ecda342af1c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20192: [SPARK-22994][k8s] Use a single image for all Spark cont...

2018-01-10 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/20192
  
If there's no more feedback I'll merge this later today.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19041: [SPARK-21097][CORE] Add option to recover cached data

2018-01-10 Thread brad-kaiser
Github user brad-kaiser commented on the issue:

https://github.com/apache/spark/pull/19041
  
Lol, no worries. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-10 Thread vanzin
GitHub user vanzin opened a pull request:

https://github.com/apache/spark/pull/20223

[SPARK-23020][core] Fix races in launcher code, test.

The race in the code is because the handle might update
its state to the wrong state if the connection handling
thread is still processing incoming data; so the handle
needs to wait for the connection to finish up before
checking the final state.

The race in the test is because when waiting for a handle
to reach a final state, the waitFor() method needs to wait
until all handle state is updated (which also includes
waiting for the connection thread above to finish).
Otherwise, waitFor() may return too early, which would cause
a bunch of different races (like the listener not being yet
notified of the state change, or being in the middle of
being notified, or the handle not being properly disposed
and causing postChecks() to assert).

On top of that I found, by code inspection, a couple of
potential races that could make a handle end up in the
wrong state when being killed.

Tested by running the existing unit tests a lot (and not
seeing the errors I was seeing before).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vanzin/spark SPARK-23020

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20223.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20223


commit 5139f605904996667d8d97941172bbe9d366a579
Author: Marcelo Vanzin 
Date:   2018-01-10T17:56:17Z

[SPARK-23020][core] Fix races in launcher code, test.

The race in the code is because the handle might update
its state to the wrong state if the connection handling
thread is still processing incoming data; so the handle
needs to wait for the connection to finish up before
checking the final state.

The race in the test is because when waiting for a handle
to reach a final state, the waitFor() method needs to wait
until all handle state is updated (which also includes
waiting for the connection thread above to finish).
Otherwise, waitFor() may return too early, which would cause
a bunch of different races (like the listener not being yet
notified of the state change, or being in the middle of
being notified, or the handle not being properly disposed
and causing postChecks() to assert).

On top of that I found, by code inspection, a couple of
potential races that could make a handle end up in the
wrong state when being killed.

Tested by running the existing unit tests a lot (and not
seeing the errors I was seeing before).




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20013: [SPARK-20657][core] Speed up rendering of the stages pag...

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20013
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85924/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20013: [SPARK-20657][core] Speed up rendering of the stages pag...

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20013
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20013: [SPARK-20657][core] Speed up rendering of the stages pag...

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20013
  
**[Test build #85924 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85924/testReport)**
 for PR 20013 at commit 
[`34d02b2`](https://github.com/apache/spark/commit/34d02b2c05f42a7992de85b7d0c447bba738d2a8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19041: [SPARK-21097][CORE] Add option to recover cached data

2018-01-10 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/19041
  
> Is there anything else you need for this PR?

An extra day on my work week...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19041: [SPARK-21097][CORE] Add option to recover cached data

2018-01-10 Thread brad-kaiser
Github user brad-kaiser commented on the issue:

https://github.com/apache/spark/pull/19041
  
Hey @vanzin just wanted to check in on this. Is there anything else you 
need for this PR? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20218
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85925/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20218
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20218
  
**[Test build #85925 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85925/testReport)**
 for PR 20218 at commit 
[`7924e28`](https://github.com/apache/spark/commit/7924e28d26623a0ba0a7a67cb6994e9ee0220677).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20222
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20222
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85923/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20222
  
**[Test build #85923 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85923/testReport)**
 for PR 20222 at commit 
[`9107f9f`](https://github.com/apache/spark/commit/9107f9f2d71610888fc4c0beac70a4a87c8348d9).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

2018-01-10 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/19872
  
@ueshin and @HyukjinKwon Thanks much for the review. I have addressed the 
latest comment.

@ueshin I think UDAF that supports partial aggregation can be build on top 
of this. The questions you asked about UDAF interfaces are very good. I haven't 
thought them through. I think it will probably end up to be another API for 
define pandas UDF (has `update` `merge` `finalize` methods, for instance) and 
another physical plan. I think we can leave that for the future. Is there 
anything specific issue with regard to UDAF that supports partial aggregation 
that you want to address here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19872
  
**[Test build #85928 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85928/testReport)**
 for PR 19872 at commit 
[`cb36227`](https://github.com/apache/spark/commit/cb362274711c1b26ed19e87aa15bc8c64668eae6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

2018-01-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r160779041
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala
 ---
@@ -15,12 +15,30 @@
  * limitations under the License.
  */
 
-package org.apache.spark.sql.execution.python
+package org.apache.spark.sql.catalyst.expressions
 
-import org.apache.spark.api.python.PythonFunction
-import org.apache.spark.sql.catalyst.expressions.{Expression, 
NonSQLExpression, Unevaluable, UserDefinedExpression}
+import org.apache.spark.api.python.{PythonEvalType, PythonFunction}
+import org.apache.spark.sql.catalyst.util.toPrettySQL
 import org.apache.spark.sql.types.DataType
 
+/**
+ * Helper functions for PythonUDF
+ */
+object PythonUDF {
+  def isScalarPythonUDF(e: Expression): Boolean = {
+e.isInstanceOf[PythonUDF] &&
+  Set(
--- End diff --

Fixed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

2018-01-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r160779007
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala
 ---
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import java.io.File
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.{SparkEnv, TaskContext}
+import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, 
ClusteredDistribution, Distribution, Partitioning}
+import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, 
UnaryExecNode}
+import org.apache.spark.sql.types.{DataType, StructField, StructType}
+import org.apache.spark.util.Utils
+
+/**
+ * Physical node for aggregation with group aggregate Pandas UDF.
+ *
+ * This plan works by sending the necessary (projected) input grouped data 
as Arrow record batches
+ * to the python worker, the python worker invokes the UDF and sends the 
results to the executor,
+ * finally the executor evaluates any post-aggregation expressions and 
join the result with the
+ * grouped key.
+ */
+case class AggregateInPandasExec(
+groupingExpressions: Seq[NamedExpression],
+udfExpressions: Seq[PythonUDF],
+resultExpressions: Seq[NamedExpression],
+child: SparkPlan)
+  extends UnaryExecNode {
+
+  override val output: Seq[Attribute] = 
resultExpressions.map(_.toAttribute)
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  override def producedAttributes: AttributeSet = AttributeSet(output)
+
+  override def requiredChildDistribution: Seq[Distribution] = {
+if (groupingExpressions.isEmpty) {
+  AllTuples :: Nil
+} else {
+  ClusteredDistribution(groupingExpressions) :: Nil
+}
+  }
+
+  private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, 
Seq[Expression]) = {
+udf.children match {
+  case Seq(u: PythonUDF) =>
+val (chained, children) = collectFunctions(u)
+(ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children)
+  case children =>
+// There should not be any other UDFs, or the children can't be 
evaluated directly.
+assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty))
+(ChainedPythonFunctions(Seq(udf.func)), udf.children)
+}
+  }
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
+Seq(groupingExpressions.map(SortOrder(_, Ascending)))
+
+  override protected def doExecute(): RDD[InternalRow] = {
+val inputRDD = child.execute()
+
+val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
+val reuseWorker = 
inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
+val sessionLocalTimeZone = conf.sessionLocalTimeZone
+val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
+
+val (pyFuncs, inputs) = udfExpressions.map(collectFunctions).unzip
+
+val allInputs = new ArrayBuffer[Expression]
+val dataTypes = new ArrayBuffer[DataType]
+val argOffsets = inputs.map { input =>
+  input.map { e =>
+if (allInputs.exists(_.semanticEquals(e))) {
+  allInputs.indexWhere(_.semanticEquals(e))
+} else {
+  allInputs += e
+  dataTypes += e.dataType
+  allInputs.length - 1
+}
+  }.toArray
+}.toArray
+
+val schema = StructType(dataTypes.zipWithIndex.map { case (dt, i) =>
+  StructField(s"_$i", dt)
+})
+
+val input = 

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

2018-01-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r160778894
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -4052,6 +4045,386 @@ def test_unsupported_types(self):
 df.groupby('id').apply(f).collect()
 
 
+@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
installed")
+class GroupbyAggTests(ReusedSQLTestCase):
+
+@property
+def data(self):
+from pyspark.sql.functions import array, explode, col, lit
+return self.spark.range(10).toDF('id') \
+.withColumn("vs", array([lit(i * 1.0) + col('id') for i in 
range(20, 30)])) \
+.withColumn("v", explode(col('vs'))) \
+.drop('vs') \
+.withColumn('w', lit(1.0))
+
+@property
+def plus_one(self):
+from pyspark.sql.functions import udf
+
+@udf('double')
+def plus_one(v):
+assert isinstance(v, (int, float))
+return v + 1
+return plus_one
+
+@property
+def plus_two(self):
+import pandas as pd
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+
+@pandas_udf('double', PandasUDFType.SCALAR)
+def plus_two(v):
+assert isinstance(v, pd.Series)
+return v + 2
+return plus_two
+
+@property
+def mean_udf(self):
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+
+@pandas_udf('double', PandasUDFType.GROUP_AGG)
+def mean_udf(v):
+return v.mean()
+return mean_udf
+
+@property
+def sum_udf(self):
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+
+@pandas_udf('double', PandasUDFType.GROUP_AGG)
+def sum_udf(v):
+return v.sum()
+return sum_udf
+
+@property
+def weighted_mean_udf(self):
+import numpy as np
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+
+@pandas_udf('double', PandasUDFType.GROUP_AGG)
+def weighted_mean_udf(v, w):
+return np.average(v, weights=w)
+return weighted_mean_udf
+
+def test_basic(self):
+from pyspark.sql.functions import col, lit, sum, mean
+
+df = self.data
+weighted_mean_udf = self.weighted_mean_udf
+
+result1 = df.groupby('id').agg(weighted_mean_udf(df.v, 
lit(1.0))).sort('id')
+expected1 = 
df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort('id')
+self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
+
+result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, 
lit(1.0)))\
+.sort(df.id + 1)
+expected2 = df.groupby((col('id') + 1))\
+.agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort(df.id 
+ 1)
+self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
+
+result3 = df.groupby('id').agg(weighted_mean_udf(df.v, 
df.w)).sort('id')
+expected3 = 
df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, w)')).sort('id')
+self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
+
+result4 = df.groupby((col('id') + 1).alias('id'))\
+.agg(weighted_mean_udf(df.v, df.w))\
+.sort('id')
+expected4 = df.groupby((col('id') + 1).alias('id'))\
+.agg(mean(df.v).alias('weighted_mean_udf(v, w)'))\
+.sort('id')
+self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
+
+def test_array(self):
+from pyspark.sql.types import ArrayType, DoubleType
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+
+with QuietTest(self.sc):
+with self.assertRaisesRegexp(NotImplementedError, 'not 
supported'):
+@pandas_udf(ArrayType(DoubleType()), 
PandasUDFType.GROUP_AGG)
+def mean_and_std_udf(v):
+return [v.mean(), v.std()]
+
+def test_struct(self):
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+
+with QuietTest(self.sc):
+with self.assertRaisesRegexp(NotImplementedError, 'not 
supported'):
+@pandas_udf('mean double, std double', 
PandasUDFType.GROUP_AGG)
+def mean_and_std_udf(v):
+return (v.mean(), v.std())
+
+def test_alias(self):
+from pyspark.sql.functions import mean
+
+df = self.data
+mean_udf = self.mean_udf
+
 

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

2018-01-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r160778794
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -153,11 +153,20 @@ trait CheckAnalysis extends PredicateHelper {
 s"of type ${condition.dataType.simpleString} is not a 
boolean.")
 
   case Aggregate(groupingExprs, aggregateExprs, child) =>
+def isAggregateExpression(expr: Expression) = {
+  expr.isInstanceOf[AggregateExpression] ||
+PythonUDF.isGroupAggPandasUDF(expr)
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

2018-01-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r160778766
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
@@ -271,9 +272,14 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
   case PhysicalAggregation(
 namedGroupingExpressions, aggregateExpressions, 
rewrittenResultExpressions, child) =>
 
+require(
+  !aggregateExpressions.exists(PythonUDF.isGroupAggPandasUDF),
+  "Streaming aggregation doesn't support group aggregate pandas 
UDF"
+)
--- End diff --

Done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-10 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r160774506
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochCoordinator.scala
 ---
@@ -39,6 +39,15 @@ private[continuous] sealed trait EpochCoordinatorMessage 
extends Serializable
  */
 private[sql] case object IncrementAndGetEpoch extends 
EpochCoordinatorMessage
 
+/**
--- End diff --

looks good to me


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20222
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20222
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85922/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...

2018-01-10 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20174#discussion_r160772523
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala
 ---
@@ -198,6 +198,20 @@ class ReplaceOperatorSuite extends PlanTest {
 comparePlans(optimized, correctAnswer)
   }
 
+  test("add one grouping key if necessary when replace Deduplicate with 
Aggregate") {
+val input = LocalRelation()
+val query = Deduplicate(Seq.empty, input) // dropDuplicates()
+val optimized = Optimize.execute(query.analyze)
+
+val correctAnswer =
--- End diff --

nit: this can be all in one line


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20222
  
**[Test build #85922 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85922/testReport)**
 for PR 20222 at commit 
[`3e1d6d6`](https://github.com/apache/spark/commit/3e1d6d6ab58411491f9620ee7e568d002759ea58).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...

2018-01-10 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/20174#discussion_r160770992
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -666,4 +665,16 @@ class DataFrameAggregateSuite extends QueryTest with 
SharedSQLContext {
   assert(exchangePlans.length == 1)
 }
   }
+
+  Seq(true, false).foreach { codegen =>
+test("SPARK-22951: dropDuplicates on empty data frames should produce 
correct aggregate" +
+  s" results when codegen enabled: $codegen") {
+  withSQLConf((SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, 
codegen.toString)) {
+assert(Seq.empty[Int].toDF("a").count() == 0)
+assert(Seq.empty[Int].toDF("a").agg(count("*")).count() == 1)
+assert(spark.emptyDataFrame.dropDuplicates().count() == 0)
+
assert(spark.emptyDataFrame.dropDuplicates().agg(count("*")).count() == 1)
--- End diff --

@liufengdb Maybe also add assertions to confirm that explicit global 
aggregations (by providing zero grouping keys) still return one row? For 
example:

```scala
val emptyAgg = Map.empty[String, String]

checkAnswer(
  spark.emptyDataFrame.agg(emptyAgg),
  Seq(Row())
)

checkAnswer(
  spark.emptyDataFrame.groupBy().agg(emptyAgg),
  Seq(Row())
)
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-10 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/20222
  
Wait

dev/release-tag.sh does this automatically though.

I just want to make sure we are not missing things (like R/DESCRIPTION)
I think maybe we should run a subset of release-tag





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-10 Thread felixcheungu
Github user felixcheungu commented on the issue:

https://github.com/apache/spark/pull/20222
  
@felixcheung 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

2018-01-10 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20153#discussion_r160764573
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java
 ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import java.util.List;
+
+import org.apache.spark.annotation.InterfaceStability;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+
+/**
+ * A mix-in interface for {@link DataSourceV2Reader}. Data source readers 
can implement this
+ * interface to output {@link ColumnarBatch} and make the scan faster.
+ */
+@InterfaceStability.Evolving
+public interface SupportsScanColumnarBatch extends DataSourceV2Reader {
+  @Override
+  default List createReadTasks() {
+throw new IllegalStateException(
+  "createReadTasks should not be called with 
SupportsScanColumnarBatch.");
+  }
+
+  /**
+   * Similar to {@link DataSourceV2Reader#createReadTasks()}, but returns 
columnar data in batches.
+   */
+  List createBatchReadTasks();
+
+  /**
+   * A safety door for columnar batch reader. It's possible that the 
implementation can only support
+   * some certain columns with certain types. Users can overwrite this 
method and
+   * {@link #createReadTasks()} to fallback to normal read path under some 
conditions.
+   */
+  default boolean enableBatchRead() {
--- End diff --

If it controls batch mode or non-batch mode, I agree.  

IIUC, this value is used to show whether we enable to read data from 
column-oriented storage (e.g. `ColumnarVector`) or  row-oriented storage (e.g. 
`UnsafeRow`). I feel that it is not a batch mode.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...

2018-01-10 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20174
  
I see, thnaks for your answer @liancheng 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...

2018-01-10 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/20174
  
@mgaido91 We can't because we do not know whether there are any input rows 
or not. For example:

```scala
val df1 = spark.range(10).select()
val df2 = spark.range(10).filter($"id" < 0).select()
val df3 = df1.dropDuplicates()
val df4 = df2.dropDuplicates()
```

`df1` has zero columns and ten rows while `df2` has no columns and zero 
rows. Therefore, `df3` should return one row containing zero columns while 
`df4` should return zero rows.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...

2018-01-10 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20174
  
@mgaido91 Your proposal and current approach are both with one line change. 
Since the issue is actually related to the hash aggregate implementation, I 
think it is reasonable to include it in the  `ReplaceDeduplicateWithAggregate` 
transformation. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20174
  
**[Test build #85927 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85927/testReport)**
 for PR 20174 at commit 
[`8418536`](https://github.com/apache/spark/commit/8418536f29ee1375018d731c580caa9e3d418bd0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...

2018-01-10 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20174#discussion_r160757582
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1221,7 +1221,12 @@ object ReplaceDeduplicateWithAggregate extends 
Rule[LogicalPlan] {
   Alias(new First(attr).toAggregateExpression(), 
attr.name)(attr.exprId)
 }
   }
-  Aggregate(keys, aggCols, child)
+  // SPARK-22951: the implementation of aggregate operator treats the 
cases with and without
+  // grouping keys differently, when there are not input rows. For the 
aggregation after
+  // `dropDuplicates()` on an empty data frame, a grouping key is 
added here to make sure the
+  // aggregate operator can work correctly (returning an empty 
iterator).
--- End diff --

ok, I like this one.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20096
  
**[Test build #85926 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85926/testReport)**
 for PR 20096 at commit 
[`f94b53e`](https://github.com/apache/spark/commit/f94b53e3ab7e37fdcb9f34cf7d1313a4905fa341).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-10 Thread jose-torres
Github user jose-torres commented on the issue:

https://github.com/apache/spark/pull/20096
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20218
  
**[Test build #85925 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85925/testReport)**
 for PR 20218 at commit 
[`7924e28`](https://github.com/apache/spark/commit/7924e28d26623a0ba0a7a67cb6994e9ee0220677).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...

2018-01-10 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20218
  
Since this is a flakiness issue, I retriggered it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...

2018-01-10 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20218
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20221: [SPARK-23019][Core] Wait until SparkContext.stop(...

2018-01-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20221


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18991: [SPARK-21783][SQL] Turn on ORC filter push-down by defau...

2018-01-10 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/18991
  
@gatorsmile . I don't have any numbers for PPD=false.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20219
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20219
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85921/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20219
  
**[Test build #85921 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85921/testReport)**
 for PR 20219 at commit 
[`3d04d4d`](https://github.com/apache/spark/commit/3d04d4dfd5b3d320ec494ec838dd299f654d6c98).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20221: [SPARK-23019][Core] Wait until SparkContext.stop() finis...

2018-01-10 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/20221
  
Merging to master / 2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20221: [SPARK-23019][Core] Wait until SparkContext.stop() finis...

2018-01-10 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/20221
  
Oh, wait, yes, this should fix SPARK-23019 (since that's caused by the 
context being still alive). I'll keep investigating SPARK-23020 separately.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

2018-01-10 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20153#discussion_r160747322
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java
 ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import java.util.List;
+
+import org.apache.spark.annotation.InterfaceStability;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+
+/**
+ * A mix-in interface for {@link DataSourceV2Reader}. Data source readers 
can implement this
+ * interface to output {@link ColumnarBatch} and make the scan faster.
+ */
+@InterfaceStability.Evolving
+public interface SupportsScanColumnarBatch extends DataSourceV2Reader {
+  @Override
+  default List createReadTasks() {
+throw new IllegalStateException(
+  "createReadTasks should not be called with 
SupportsScanColumnarBatch.");
+  }
+
+  /**
+   * Similar to {@link DataSourceV2Reader#createReadTasks()}, but returns 
columnar data in batches.
+   */
+  List createBatchReadTasks();
+
+  /**
+   * A safety door for columnar batch reader. It's possible that the 
implementation can only support
+   * some certain columns with certain types. Users can overwrite this 
method and
+   * {@link #createReadTasks()} to fallback to normal read path under some 
conditions.
+   */
+  default boolean enableBatchRead() {
--- End diff --

This name is more general. It looks fine to me. In the future, if we 
support another batch read mode, we can add the extra function to further 
identify the batch mode.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

2018-01-10 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20153#discussion_r160746791
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java
 ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import java.util.List;
+
+import org.apache.spark.annotation.InterfaceStability;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+
+/**
+ * A mix-in interface for {@link DataSourceV2Reader}. Data source readers 
can implement this
+ * interface to output {@link ColumnarBatch} and make the scan faster.
--- End diff --

We need to explain the precedence of `SupportsScanColumnarBatch ` and 
`SupportsScanUnsafeRow`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20221: [SPARK-23019][Core] Wait until SparkContext.stop() finis...

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20221
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85919/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20221: [SPARK-23019][Core] Wait until SparkContext.stop() finis...

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20221
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20221: [SPARK-23019][Core] Wait until SparkContext.stop() finis...

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20221
  
**[Test build #85919 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85919/testReport)**
 for PR 20221 at commit 
[`16b985c`](https://github.com/apache/spark/commit/16b985c1018c0f5f54ef1062bdd7960cc4a87b39).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20013: [SPARK-20657][core] Speed up rendering of the stages pag...

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20013
  
**[Test build #85924 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85924/testReport)**
 for PR 20013 at commit 
[`34d02b2`](https://github.com/apache/spark/commit/34d02b2c05f42a7992de85b7d0c447bba738d2a8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-10 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/20222
  
Oh I see, they're just not listed on 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ . 
I believe I can fix that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20222: [SPARK-23028] Bump master branch version to 2.4.0...

2018-01-10 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20222#discussion_r160741170
  
--- Diff: project/MimaExcludes.scala ---
@@ -34,6 +34,10 @@ import com.typesafe.tools.mima.core.ProblemFilters._
  */
 object MimaExcludes {
 
+  // Exclude rules for 2.4.x
--- End diff --

We might want to make my proposed MiMa changes in a separate patch so the 
same set of reviewed changes can go to both master and branch-2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20222: [SPARK-23028] Bump master branch version to 2.4.0...

2018-01-10 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20222#discussion_r160740904
  
--- Diff: project/MimaExcludes.scala ---
@@ -34,6 +34,10 @@ import com.typesafe.tools.mima.core.ProblemFilters._
  */
 object MimaExcludes {
 
+  // Exclude rules for 2.4.x
--- End diff --

This reminds me that we should probably bump `previousSparkVersion` in 
`MimaBuild.scala` to be 2.2.0: 
https://github.com/apache/spark/blame/f340b6b3066033d40b7e163fd5fb68e9820adfb1/project/MimaBuild.scala#L91.
 I think this should happen for both master and branch-2.3.

See #15061 for an example of a similar change when 2.1.x was being prepared 
(it looks like we missed this step for 2.2.0). We may also need to un-exclude 
any new subprojects / artifacts that were added in 2.2.0 since they now need to 
be backwards-compatibility-tested for 2.3.x.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-10 Thread JoshRosen
Github user JoshRosen commented on the issue:

https://github.com/apache/spark/pull/20222
  
We should already be set up for 2.3.x builds in AMPLab Jenkins. For 
example: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.3-test-maven-hadoop-2.6/


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...

2018-01-10 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/20210
  
Thanks @ueshin and @HyukjinKwon !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20213: [SPARK-23018][PYTHON] Fix createDataFrame from Pandas ti...

2018-01-10 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/20213
  
Thanks @ueshin and @HyukjinKwon !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

2018-01-10 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20072
  
cc @CodingCat 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18991: [SPARK-21783][SQL] Turn on ORC filter push-down by defau...

2018-01-10 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18991
  
What is the performance number when turning it on, compared with the off 
mode?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20219
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85920/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20219
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20219
  
**[Test build #85920 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85920/testReport)**
 for PR 20219 at commit 
[`8e821a6`](https://github.com/apache/spark/commit/8e821a62135d63b937851372176ab053125cc9f5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20203: [SPARK-22577] [core] executor page blacklist status shou...

2018-01-10 Thread attilapiros
Github user attilapiros commented on the issue:

https://github.com/apache/spark/pull/20203
  
cc @tgravescs  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20151: [SPARK-22959][PYTHON] Configuration to select the module...

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20151
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20151: [SPARK-22959][PYTHON] Configuration to select the module...

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20151
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85918/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20151: [SPARK-22959][PYTHON] Configuration to select the module...

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20151
  
**[Test build #85918 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85918/testReport)**
 for PR 20151 at commit 
[`fc65803`](https://github.com/apache/spark/commit/fc658034639c1aa56ff5b9a44624cad05377fe51).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20151: [SPARK-22959][PYTHON] Configuration to select the module...

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20151
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85917/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20151: [SPARK-22959][PYTHON] Configuration to select the module...

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20151
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20151: [SPARK-22959][PYTHON] Configuration to select the module...

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20151
  
**[Test build #85917 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85917/testReport)**
 for PR 20151 at commit 
[`ea5b987`](https://github.com/apache/spark/commit/ea5b987d59f415045a2a890d9e4cf30198d82717).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20222
  
**[Test build #85923 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85923/testReport)**
 for PR 20222 at commit 
[`9107f9f`](https://github.com/apache/spark/commit/9107f9f2d71610888fc4c0beac70a4a87c8348d9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20221: [SPARK-23019][Core] Wait until SparkContext.stop(...

2018-01-10 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/20221#discussion_r160719518
  
--- Diff: 
core/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java ---
@@ -133,6 +134,10 @@ public void testInProcessLauncher() throws Exception {
 p.put(e.getKey(), e.getValue());
   }
   System.setProperties(p);
+  // Here DAGScheduler is stopped, while 
SparkContext.clearActiveContext may not be called yet.
+  // Wait for a reasonable amount of time to avoid creating two active 
SparkContext in JVM.
+  // See SPARK-23019 and SparkContext.stop() for details.
+  TimeUnit.MILLISECONDS.sleep(500);
--- End diff --

Yep, either way should work.
Using `TimeUnit` is just following `BaseSuite.java`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-10 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/20222
  
`R/pkg/DESCRIPTION` needs an update too.
I'll update the release process.

@JoshRosen or @shaneknapp looks like we also need to have set up the 
branch-2.3 tests in Jenkins. Right now it's not tested. I might have access to 
do it. Is it safe to copy and paste existing jobs and change the branch 
reference?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

2018-01-10 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20208
  
Also, ping @rxin , too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18991: [SPARK-21783][SQL] Turn on ORC filter push-down by defau...

2018-01-10 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/18991
  
Yes, @cloud-fan . I added the same test coverage for ORC in Apache Spark.
Sorry, @gatorsmile . I always turned on PPD, so there is no perf number for 
PPD=false.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-10 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20222
  
@srowen Yeah. I am still in travel mode. : ) Please help do it. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20195: [SPARK-22972] Couldn't find corresponding Hive SerDe for...

2018-01-10 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20195
  
Thanks! Merged to 2.2


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20199: [Spark-22967][TESTS]Fix VersionSuite's unit tests...

2018-01-10 Thread Ngone51
Github user Ngone51 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20199#discussion_r160709123
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala ---
@@ -842,6 +842,7 @@ class VersionsSuite extends SparkFunSuite with Logging {
 }
 
 test(s"$version: SPARK-17920: Insert into/overwrite avro table") {
+  assume(!(Utils.isWindows && version == "0.12"))
--- End diff --

@HyukjinKwon Ok, will comment.

And I have find serval issues which exactly describe the same problem:
- https://issues.apache.org/jira/browse/SPARK-12216
- https://issues.apache.org/jira/browse/SPARK-8333
- https://issues.apache.org/jira/browse/SPARK-18979

It seems this is a common issue exists on Windows and haven't been resolved 
yet. And I didn't find a accurate cause for the error from those issues. And 
now, I doubt this problem maybe related to URLClassLoader. And, I have 
reproduce the same issue after I did a experiment with URLClassLoader. Though, 
I'm not sure this is the real cause, yet. Working on this.

We can merge this pr to master after I add comment. And open a new pr of 
that issue(if fixed) or communicate under one of those issues. WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

2018-01-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r160708666
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala
 ---
@@ -15,12 +15,30 @@
  * limitations under the License.
  */
 
-package org.apache.spark.sql.execution.python
+package org.apache.spark.sql.catalyst.expressions
 
-import org.apache.spark.api.python.PythonFunction
-import org.apache.spark.sql.catalyst.expressions.{Expression, 
NonSQLExpression, Unevaluable, UserDefinedExpression}
+import org.apache.spark.api.python.{PythonEvalType, PythonFunction}
+import org.apache.spark.sql.catalyst.util.toPrettySQL
 import org.apache.spark.sql.types.DataType
 
+/**
+ * Helper functions for PythonUDF
+ */
+object PythonUDF {
+  def isScalarPythonUDF(e: Expression): Boolean = {
+e.isInstanceOf[PythonUDF] &&
+  Set(
--- End diff --

Aha, good call.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

2018-01-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r160708246
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -511,7 +517,6 @@ def test_udf_with_order_by_and_limit(self):
 my_copy = udf(lambda x: x, IntegerType())
 df = self.spark.range(10).orderBy("id")
 res = df.select(df.id, my_copy(df.id).alias("copy")).limit(1)
-res.explain(True)
--- End diff --

Yes, I think this is removed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-10 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/20222
  
Yeah ideally that's part of the branching process. I think it could be 
documented in the "Preparing Spark for Release" section of `release-process.md` 
in `spark-website`. Open up a separate PR for that if you like, or I can do it 
if you're not set up for it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19802: [SPARK-22594][CORE] Handling spark-submit and master ver...

2018-01-10 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19802
  
I see, yeah that's a fine point. Don't rethrow then.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19802: [SPARK-22594][CORE] Handling spark-submit and master ver...

2018-01-10 Thread Jiri-Kremser
Github user Jiri-Kremser commented on the issue:

https://github.com/apache/spark/pull/19802
  
> I think it's probably OK if you also re-throw that exception for now. 
You're just giving more info then.

Hmm, In the original code the exception wasn't re-thrown. Do you really 
think it is a good idea? The `InvalidClassException` is a checked exception so 
the signature of `processOneWayMessage()` method would need to be change as 
well.

this is the original catch-them-all code :)
```java
} catch (Exception e) {
  logger.error("Error while invoking RpcHandler#receive() for one-way 
message.", e);
}
```

I think, I addressed all your inline other comments, thanks for the review.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20222
  
**[Test build #85922 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85922/testReport)**
 for PR 20222 at commit 
[`3e1d6d6`](https://github.com/apache/spark/commit/3e1d6d6ab58411491f9620ee7e568d002759ea58).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-10 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20222
  
cc @yhuai @sameeragarwal @JoshRosen @rxin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20222: [SPARK-23028] Bump master branch version to 2.4.0...

2018-01-10 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/20222

[SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT

## What changes were proposed in this pull request?
This patch bumps the master branch version to `2.4.0-SNAPSHOT`.

## How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark bump24

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20222.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20222


commit 932d93e1bbeecd9f8bec0b4d80e42f7e244787e6
Author: gatorsmile 
Date:   2018-01-10T14:57:34Z

bump

commit 3e1d6d6ab58411491f9620ee7e568d002759ea58
Author: gatorsmile 
Date:   2018-01-10T14:59:37Z

bump




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20219
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85916/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection

2018-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20219
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   >