date:20180111

[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-11 Thread gengliangwang

Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/20223#discussion_r160991236
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java ---
@@ -71,15 +71,16 @@ public void stop() {
   @Override
   public synchronized void disconnect() {
 if (!disposed) {
--- End diff --

`if(!isDisposed())` ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-11 Thread gengliangwang

Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/20223#discussion_r160989352
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java ---
@@ -91,10 +92,15 @@ LauncherConnection getConnection() {
 return connection;
   }
 
-  boolean isDisposed() {
+  synchronized boolean isDisposed() {
--- End diff --

why do we need `synchronized` here


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20226: [SPARK-23034][SQL][UI] Display tablename for `Hiv...

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20226#discussion_r161006284
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
 ---
@@ -62,6 +62,8 @@ case class HiveTableScanExec(
 
   override def conf: SQLConf = sparkSession.sessionState.conf
 
+  override def nodeName: String = s"${super.nodeName} 
(${relation.tableMeta.qualifiedName})"
--- End diff --

Our `DataSourceScanExec` is using `unquotedString` in nodeName. We need to 
make these LeafNode consistent. Could you check all the other `LeafExecNode`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20203: [SPARK-22577] [core] executor page blacklist status shou...

2018-01-11 Thread tgravescs

Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/20203
  
thanks for working on this, I'm going to try this out and do further 
review.  Did you test for application failures and on the history server?

cc @squito since he had some comments on the jira.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20227: [SPARK-23035][SQL] Fix warning: TEMPORARY TABLE ....

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20227#discussion_r161003528
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala
 ---
@@ -33,6 +33,9 @@ class TableAlreadyExistsException(db: String, table: 
String)
 class TempTableAlreadyExistsException(table: String)
   extends AnalysisException(s"Temporary table '$table' already exists")
 
+class TempViewAlreadyExistsException(table: String)
+  extends AnalysisException(s"Temporary view '$table' already exists")
--- End diff --

We do not want to introduce a new exception type. In contrast, we planned 
to remove all these exception types because PySpark might output a confusing 
error message.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20206: [SPARK-19256][SQL] Remove ordering enforcement from `Fil...

2018-01-11 Thread tejasapatil

Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/20206
  
cc @cloud-fan @gengliangwang for review


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20228: [SPARK-23036][SQL] Add withGlobalTempView for testing an...

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20228
  
**[Test build #85976 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85976/testReport)**
 for PR 20228 at commit 
[`fffd109`](https://github.com/apache/spark/commit/fffd109e8c084f9a4d63840bf761364f1ede5dc9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

2018-01-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20237
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85974/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20227: [SPARK-23035][SQL] Fix warning: TEMPORARY TABLE ....

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20227#discussion_r161003157
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala
 ---
@@ -33,6 +33,9 @@ class TableAlreadyExistsException(db: String, table: 
String)
 class TempTableAlreadyExistsException(table: String)
   extends AnalysisException(s"Temporary table '$table' already exists")
--- End diff --

You just need to change `Temporary table` to `Temporary view`. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

2018-01-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20237
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20237
  
**[Test build #85974 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85974/testReport)**
 for PR 20237 at commit 
[`d2cfed3`](https://github.com/apache/spark/commit/d2cfed308d343fb55c5fd7c0d30bcbb987948632).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20228: [SPARK-23036][SQL] Add withGlobalTempView for testing an...

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20228
  
ok to test 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20228: [SPARK-23036][SQL] Add withGlobalTempView for testing an...

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20228
  
@xubo245 This is only for TEST? Then, add the tag [TEST] in your title too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20194: [SPARK-22999][SQL]'show databases like command' c...

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20194#discussion_r161000312
  
--- Diff: 
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 ---
@@ -141,7 +141,7 @@ statement
 (LIKE? pattern=STRING)?
#showTables
 | SHOW TABLE EXTENDED ((FROM | IN) db=identifier)?
 LIKE pattern=STRING partitionSpec? 
#showTable
-| SHOW DATABASES (LIKE pattern=STRING)?
#showDatabases
+| SHOW DATABASES (LIKE? pattern=STRING)?
#showDatabases
--- End diff --

Impala is the only reference I can find 
https://www.cloudera.com/documentation/enterprise/5-10-x/topics/impala_show.html

To be consistent with the other SHOW function, we can make LIKE optional?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20219
  
`NullType` is not well supported in almost all the data sources. We did not 
mention it in our doc 
https://spark.apache.org/docs/latest/sql-programming-guide.html 

cc @cloud-fan @marmbrus @rxin @sameeragarwal Any comment about this support?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpark

2018-01-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20217
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85971/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpark

2018-01-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20217
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpark

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20217
  
**[Test build #85971 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85971/testReport)**
 for PR 20217 at commit 
[`83776d6`](https://github.com/apache/spark/commit/83776d643c92a9354605320c20313a0380048388).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

2018-01-11 Thread icexelloss

Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/20237
  
@HyukjinKwon Thanks! I think this is good.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20189: [SPARK-22975][SS] MetricsReporter should not throw excep...

2018-01-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20189
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20189: [SPARK-22975][SS] MetricsReporter should not throw excep...

2018-01-11 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20189
  
@zsxwing the test you suggested fails on Jenkins while locally it passes. I 
am no sure about the reason since I cannot reproduce it. I had the same test 
error in my previous trials. Previously the problem was that running other test 
cases before this one produced the issue. May you please advice what to do? Can 
I revert to my previous test?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20219
  
**[Test build #85975 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85975/testReport)**
 for PR 20219 at commit 
[`27f0e14`](https://github.com/apache/spark/commit/27f0e143021a880787134cce7e672709449c828f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20233: [SPARK-23043][BUILD] Upgrade json4s to 3.5.3

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20233
  
**[Test build #4035 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4035/testReport)**
 for PR 20233 at commit 
[`e9cf606`](https://github.com/apache/spark/commit/e9cf6061a883abbee77ce50f5aa15416f7b86028).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20189: [SPARK-22975][SS] MetricsReporter should not throw excep...

2018-01-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20189
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85967/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20189: [SPARK-22975][SS] MetricsReporter should not throw excep...

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20189
  
**[Test build #85967 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85967/testReport)**
 for PR 20189 at commit 
[`c67e07b`](https://github.com/apache/spark/commit/c67e07b66726d57c2b7a3462b226439518d9ebed).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20237
  
**[Test build #85974 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85974/testReport)**
 for PR 20237 at commit 
[`d2cfed3`](https://github.com/apache/spark/commit/d2cfed308d343fb55c5fd7c0d30bcbb987948632).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

2018-01-11 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20237
  
Hey @gatorsmile, @ueshin, @BryanCutler and @icexelloss. Let's fix this by 
clarifying it to avoid potential confusion for now and clear up SPARK-22216's 
subtasks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of ...

2018-01-11 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/20237

[SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each 
batch within scalar Pandas UDF

## What changes were proposed in this pull request?

This PR proposes to add a note that saying the length of a scalar Pandas 
UDF's `Series` is not of the whole input column but of the batch.

We are fine for a group map UDF because the usage is different from our 
typical UDF but scalar UDFs might cause confusion with the normal UDF.

For example, please consider this example:

```python
from pyspark.sql.functions import pandas_udf, col, lit

df = spark.range(1)
f = pandas_udf(lambda x, y: len(x) + y, LongType())
df.select(f(lit('text'), col('id'))).show()
```

```
+--+
|(text, id)|
+--+
| 1|
+--+
```

```python
from pyspark.sql.functions import udf, col, lit

df = spark.range(1)
f = udf(lambda x, y: len(x) + y, "long")
df.select(f(lit('text'), col('id'))).show()
```

```
+--+
|(text, id)|
+--+
| 4|
+--+
```

## How was this patch tested?

Manually built the doc and checked the output.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-22980

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20237.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20237


commit d2cfed308d343fb55c5fd7c0d30bcbb987948632
Author: hyukjinkwon 
Date:   2018-01-11T15:31:05Z

Clarify the length of each series is of each batch within scalar Pandas UDF




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20219: [SPARK-23025][SQL] Support Null type in scala ref...

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20219#discussion_r160990065
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala
 ---
@@ -356,4 +356,13 @@ class ScalaReflectionSuite extends SparkFunSuite {
 assert(deserializerFor[Int].isInstanceOf[AssertNotNull])
 assert(!deserializerFor[String].isInstanceOf[AssertNotNull])
   }
+
+  test("SPARK-23025: schemaFor shuold support Null type") {
--- End diff --

should?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20236: [SPARK-23044] Error handling for jira assignment

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20236
  
**[Test build #85973 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85973/testReport)**
 for PR 20236 at commit 
[`8c4cc61`](https://github.com/apache/spark/commit/8c4cc61c2a06b310480fafb3b28067a6f961816a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20236: [SPARK-23044] Error handling for jira assignment

2018-01-11 Thread squito

Github user squito commented on the issue:

https://github.com/apache/spark/pull/20236
  
@jerryshao this should fix it, but I don't have anything to merge to test 
this out -- would appreciate if someone could try it before we merge.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20236: [SPARK-23044] Error handling for jira assignment

2018-01-11 Thread squito

GitHub user squito opened a pull request:

https://github.com/apache/spark/pull/20236

[SPARK-23044] Error handling for jira assignment

## What changes were proposed in this pull request?

In case the selected user isn't a contributor yet, or any other unexpected 
error, just don't assign the jira.

## How was this patch tested?

Couldn't really test the error case, just some testing of similar-ish code 
in python shell.  Haven't run a merge yet.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/squito/spark SPARK-23044

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20236.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20236


commit 8c4cc61c2a06b310480fafb3b28067a6f961816a
Author: Imran Rashid 
Date:   2018-01-11T15:42:16Z

[SPARK-23044] Error handling for jira assignment




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20214: [SPARK-23023][SQL] Cast field data to strings in showStr...

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20214
  
**[Test build #85972 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85972/testReport)**
 for PR 20214 at commit 
[`66b06c3`](https://github.com/apache/spark/commit/66b06c37bd2a9ed8461f9572eb3f2081853a8d73).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpark

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20217
  
**[Test build #85971 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85971/testReport)**
 for PR 20217 at commit 
[`83776d6`](https://github.com/apache/spark/commit/83776d643c92a9354605320c20313a0380048388).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encode label...

2018-01-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20232
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20217#discussion_r160985677
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -255,26 +255,67 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> _ = spark.udf.register("stringLengthInt", len, IntegerType())
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
+"""
+
+# This is to check whether the input function is a wrapped/native 
UserDefinedFunction
+if hasattr(f, 'asNondeterministic'):
+raise ValueError("Please use registerUDF for registering UDF. 
The expected function of "
+ "registerFunction is a Python function 
(including lambda function)")
+udf = UserDefinedFunction(f, returnType=returnType, name=name,
+  evalType=PythonEvalType.SQL_BATCHED_UDF)
+self._jsparkSession.udf().registerPython(name, udf._judf)
+return udf._wrapped()
+
+@ignore_unicode_prefix
+@since(2.3)
+def registerUDF(self, name, f):
+"""Registers a :class:`UserDefinedFunction`. The registered UDF 
can be used in SQL
+statement.
+
+:param name: name of the UDF
--- End diff --

After checking with our Scala API, I found the name is not changed. Let me 
update it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...

2018-01-11 Thread icexelloss

Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/20217#discussion_r160985290
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -255,26 +255,67 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> _ = spark.udf.register("stringLengthInt", len, IntegerType())
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
+"""
+
+# This is to check whether the input function is a wrapped/native 
UserDefinedFunction
+if hasattr(f, 'asNondeterministic'):
+raise ValueError("Please use registerUDF for registering UDF. 
The expected function of "
+ "registerFunction is a Python function 
(including lambda function)")
+udf = UserDefinedFunction(f, returnType=returnType, name=name,
+  evalType=PythonEvalType.SQL_BATCHED_UDF)
+self._jsparkSession.udf().registerPython(name, udf._judf)
+return udf._wrapped()
+
+@ignore_unicode_prefix
+@since(2.3)
+def registerUDF(self, name, f):
+"""Registers a :class:`UserDefinedFunction`. The registered UDF 
can be used in SQL
+statement.
+
+:param name: name of the UDF
+:param f: a wrapped/native UserDefinedFunction. The UDF can be 
either row-at-a-time or
--- End diff --

SGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20220: Update PageRank.scala

2018-01-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20220


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20203: [SPARK-22577] [core] executor page blacklist stat...

2018-01-11 Thread tgravescs

Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/20203#discussion_r160985168
  
--- Diff: 
core/src/main/scala/org/apache/spark/status/AppStatusListener.scala ---
@@ -223,6 +228,15 @@ private[spark] class AppStatusListener(
 updateNodeBlackList(event.hostId, false)
   }
 
+  def updateBlackListStatusForStage(executorId: String, stageId: Int, 
stageAttemptId: Int): Unit = {
+Option(liveStages.get((stageId, stageAttemptId))).foreach { stage =>
+  val now = System.nanoTime()
+  val esummary = stage.executorSummary(executorId)
+  esummary.isBlacklisted = true
+  maybeUpdate(esummary, now)
+}
+  }
+
--- End diff --

@vanzin you are more familiar with the new history server.  I am wondering 
why is the updateBlacklistStatus only done with liveUpdate?  Doesn't that mean 
it won't show up in history server for finished app?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20217#discussion_r160984479
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -255,26 +255,67 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> _ = spark.udf.register("stringLengthInt", len, IntegerType())
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
+"""
+
+# This is to check whether the input function is a wrapped/native 
UserDefinedFunction
+if hasattr(f, 'asNondeterministic'):
+raise ValueError("Please use registerUDF for registering UDF. 
The expected function of "
+ "registerFunction is a Python function 
(including lambda function)")
+udf = UserDefinedFunction(f, returnType=returnType, name=name,
+  evalType=PythonEvalType.SQL_BATCHED_UDF)
+self._jsparkSession.udf().registerPython(name, udf._judf)
+return udf._wrapped()
+
+@ignore_unicode_prefix
+@since(2.3)
+def registerUDF(self, name, f):
+"""Registers a :class:`UserDefinedFunction`. The registered UDF 
can be used in SQL
+statement.
+
+:param name: name of the UDF
+:param f: a wrapped/native UserDefinedFunction. The UDF can be 
either row-at-a-time or
--- End diff --

We can do it as a separate PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20220: Update PageRank.scala

2018-01-11 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/20220
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20080: [SPARK-22870][CORE] Dynamic allocation should allow 0 id...

2018-01-11 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/20080
  
Ping @wangyum 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...

2018-01-11 Thread icexelloss

Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/20217#discussion_r160983255
  
--- Diff: python/pyspark/sql/context.py ---
@@ -203,18 +203,46 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> _ = sqlContext.udf.register("stringLengthInt", lambda x: 
len(x), IntegerType())
 >>> sqlContext.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
+"""
+return self.sparkSession.catalog.registerFunction(name, f, 
returnType)
+
+@ignore_unicode_prefix
+@since(2.3)
+def registerUDF(self, name, f):
--- End diff --

I am not sure about the difference between:
`spark.udf.registerUDF` `sqlContext.udf.registerUDF` and 
`sqlContext.registerUDF`

Seems too many ways to do the same thing...But if we indeed need to keep 
multiple methods, I would lean towards having comprehensive doc in one of them 
and have the doc for the rest to be something like 
"""
Same as :meth:...
"""


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20226: [SPARK-23034][SQL][UI] Display tablename for `HiveTableS...

2018-01-11 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20226
  
LGTM too


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...

2018-01-11 Thread icexelloss

Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/20217#discussion_r160981094
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -255,26 +255,67 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> _ = spark.udf.register("stringLengthInt", len, IntegerType())
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
+"""
+
+# This is to check whether the input function is a wrapped/native 
UserDefinedFunction
+if hasattr(f, 'asNondeterministic'):
--- End diff --

I am fine with it as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...

2018-01-11 Thread icexelloss

Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/20217#discussion_r160980765
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -255,26 +255,67 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> _ = spark.udf.register("stringLengthInt", len, IntegerType())
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
+"""
+
+# This is to check whether the input function is a wrapped/native 
UserDefinedFunction
+if hasattr(f, 'asNondeterministic'):
+raise ValueError("Please use registerUDF for registering UDF. 
The expected function of "
+ "registerFunction is a Python function 
(including lambda function)")
+udf = UserDefinedFunction(f, returnType=returnType, name=name,
+  evalType=PythonEvalType.SQL_BATCHED_UDF)
+self._jsparkSession.udf().registerPython(name, udf._judf)
+return udf._wrapped()
+
+@ignore_unicode_prefix
+@since(2.3)
+def registerUDF(self, name, f):
+"""Registers a :class:`UserDefinedFunction`. The registered UDF 
can be used in SQL
+statement.
+
+:param name: name of the UDF
--- End diff --

I also think changing the property (name) of the input udf object in 
`registerUDF` is unintuitive..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20235: [Spark-22887][ML][TESTS][WIP] ML test for Structu...

2018-01-11 Thread smurakozi

Github user smurakozi commented on a diff in the pull request:

https://github.com/apache/spark/pull/20235#discussion_r160980529
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/fpm/FPGrowthSuite.scala 
---
@@ -34,86 +35,122 @@ class FPGrowthSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defaul
   }
 
   test("FPGrowth fit and transform with different data types") {
-Array(IntegerType, StringType, ShortType, LongType, ByteType).foreach 
{ dt =>
-  val data = dataset.withColumn("items", 
col("items").cast(ArrayType(dt)))
-  val model = new FPGrowth().setMinSupport(0.5).fit(data)
-  val generatedRules = model.setMinConfidence(0.5).associationRules
-  val expectedRules = spark.createDataFrame(Seq(
-(Array("2"), Array("1"), 1.0),
-(Array("1"), Array("2"), 0.75)
-  )).toDF("antecedent", "consequent", "confidence")
-.withColumn("antecedent", col("antecedent").cast(ArrayType(dt)))
-.withColumn("consequent", col("consequent").cast(ArrayType(dt)))
-  assert(expectedRules.sort("antecedent").rdd.collect().sameElements(
-generatedRules.sort("antecedent").rdd.collect()))
-
-  val transformed = model.transform(data)
-  val expectedTransformed = spark.createDataFrame(Seq(
-(0, Array("1", "2"), Array.emptyIntArray),
-(0, Array("1", "2"), Array.emptyIntArray),
-(0, Array("1", "2"), Array.emptyIntArray),
-(0, Array("1", "3"), Array(2))
-  )).toDF("id", "items", "prediction")
-.withColumn("items", col("items").cast(ArrayType(dt)))
-.withColumn("prediction", col("prediction").cast(ArrayType(dt)))
-  assert(expectedTransformed.collect().toSet.equals(
-transformed.collect().toSet))
+  class DataTypeWithEncoder[A](val a: DataType)
+  (implicit val encoder: Encoder[(Int, 
Array[A], Array[A])])
--- End diff --

Done, thx.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...

2018-01-11 Thread icexelloss

Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/20217#discussion_r160980191
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -255,26 +255,67 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> _ = spark.udf.register("stringLengthInt", len, IntegerType())
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
+"""
+
+# This is to check whether the input function is a wrapped/native 
UserDefinedFunction
+if hasattr(f, 'asNondeterministic'):
+raise ValueError("Please use registerUDF for registering UDF. 
The expected function of "
+ "registerFunction is a Python function 
(including lambda function)")
+udf = UserDefinedFunction(f, returnType=returnType, name=name,
+  evalType=PythonEvalType.SQL_BATCHED_UDF)
+self._jsparkSession.udf().registerPython(name, udf._judf)
+return udf._wrapped()
+
+@ignore_unicode_prefix
+@since(2.3)
+def registerUDF(self, name, f):
+"""Registers a :class:`UserDefinedFunction`. The registered UDF 
can be used in SQL
+statement.
+
+:param name: name of the UDF
+:param f: a wrapped/native UserDefinedFunction. The UDF can be 
either row-at-a-time or
--- End diff --

Yeah. I am fine with the doc here. But similar to 
https://github.com/apache/spark/pull/20217#discussion_r160979747 we should 
probably standardize the description.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20235: [Spark-22887][ML][TESTS][WIP] ML test for Structu...

2018-01-11 Thread attilapiros

Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20235#discussion_r160979218
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/fpm/FPGrowthSuite.scala 
---
@@ -34,86 +35,122 @@ class FPGrowthSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defaul
   }
 
   test("FPGrowth fit and transform with different data types") {
-Array(IntegerType, StringType, ShortType, LongType, ByteType).foreach 
{ dt =>
-  val data = dataset.withColumn("items", 
col("items").cast(ArrayType(dt)))
-  val model = new FPGrowth().setMinSupport(0.5).fit(data)
-  val generatedRules = model.setMinConfidence(0.5).associationRules
-  val expectedRules = spark.createDataFrame(Seq(
-(Array("2"), Array("1"), 1.0),
-(Array("1"), Array("2"), 0.75)
-  )).toDF("antecedent", "consequent", "confidence")
-.withColumn("antecedent", col("antecedent").cast(ArrayType(dt)))
-.withColumn("consequent", col("consequent").cast(ArrayType(dt)))
-  assert(expectedRules.sort("antecedent").rdd.collect().sameElements(
-generatedRules.sort("antecedent").rdd.collect()))
-
-  val transformed = model.transform(data)
-  val expectedTransformed = spark.createDataFrame(Seq(
-(0, Array("1", "2"), Array.emptyIntArray),
-(0, Array("1", "2"), Array.emptyIntArray),
-(0, Array("1", "2"), Array.emptyIntArray),
-(0, Array("1", "3"), Array(2))
-  )).toDF("id", "items", "prediction")
-.withColumn("items", col("items").cast(ArrayType(dt)))
-.withColumn("prediction", col("prediction").cast(ArrayType(dt)))
-  assert(expectedTransformed.collect().toSet.equals(
-transformed.collect().toSet))
+  class DataTypeWithEncoder[A](val a: DataType)
+  (implicit val encoder: Encoder[(Int, 
Array[A], Array[A])])
--- End diff --

In DataTypeWithEncoder I would suggest to rename the val "a" to "dataType".


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...

2018-01-11 Thread icexelloss

Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/20217#discussion_r160979747
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -255,26 +255,67 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> _ = spark.udf.register("stringLengthInt", len, IntegerType())
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
+"""
+
+# This is to check whether the input function is a wrapped/native 
UserDefinedFunction
+if hasattr(f, 'asNondeterministic'):
+raise ValueError("Please use registerUDF for registering UDF. 
The expected function of "
+ "registerFunction is a Python function 
(including lambda function)")
+udf = UserDefinedFunction(f, returnType=returnType, name=name,
+  evalType=PythonEvalType.SQL_BATCHED_UDF)
+self._jsparkSession.udf().registerPython(name, udf._judf)
+return udf._wrapped()
+
+@ignore_unicode_prefix
+@since(2.3)
+def registerUDF(self, name, f):
+"""Registers a :class:`UserDefinedFunction`. The registered UDF 
can be used in SQL
+statement.
+
+:param name: name of the UDF
+:param f: a wrapped/native UserDefinedFunction. The UDF can be 
either row-at-a-time or
+  scalar vectorized. Grouped vectorized UDFs are not 
supported.
--- End diff --

Ok. Maybe as a follow up we can standardize the language to describe the 
`udf` and `pandas_udf` objects.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20072
  
**[Test build #85970 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85970/testReport)**
 for PR 20072 at commit 
[`6fe8589`](https://github.com/apache/spark/commit/6fe85892ca6c53b3aa965951dc7ef51df9d068e2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20183: [SPARK-22986][Core] Use a cache to avoid instantiating m...

2018-01-11 Thread ho3rexqj

Github user ho3rexqj commented on the issue:

https://github.com/apache/spark/pull/20183
  
Updated the above, thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20158: [PySpark] Fix typo in comments in PySpark's udf() defini...

2018-01-11 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20158
  
@rednaxelafx, can you fix the one in `pandas_udf` too? I'll just merge this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20204
  
**[Test build #85969 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85969/testReport)**
 for PR 20204 at commit 
[`e8e7112`](https://github.com/apache/spark/commit/e8e71128951f64d6f70c0bac1729e33629bf7dd1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20235: [Spark-22887][ML][TESTS][WIP] ML test for Structu...

2018-01-11 Thread smurakozi

Github user smurakozi commented on a diff in the pull request:

https://github.com/apache/spark/pull/20235#discussion_r160969767
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/fpm/FPGrowthSuite.scala 
---
@@ -34,86 +35,122 @@ class FPGrowthSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defaul
   }
 
   test("FPGrowth fit and transform with different data types") {
-Array(IntegerType, StringType, ShortType, LongType, ByteType).foreach 
{ dt =>
-  val data = dataset.withColumn("items", 
col("items").cast(ArrayType(dt)))
-  val model = new FPGrowth().setMinSupport(0.5).fit(data)
-  val generatedRules = model.setMinConfidence(0.5).associationRules
-  val expectedRules = spark.createDataFrame(Seq(
-(Array("2"), Array("1"), 1.0),
-(Array("1"), Array("2"), 0.75)
-  )).toDF("antecedent", "consequent", "confidence")
-.withColumn("antecedent", col("antecedent").cast(ArrayType(dt)))
-.withColumn("consequent", col("consequent").cast(ArrayType(dt)))
-  assert(expectedRules.sort("antecedent").rdd.collect().sameElements(
-generatedRules.sort("antecedent").rdd.collect()))
-
-  val transformed = model.transform(data)
-  val expectedTransformed = spark.createDataFrame(Seq(
-(0, Array("1", "2"), Array.emptyIntArray),
-(0, Array("1", "2"), Array.emptyIntArray),
-(0, Array("1", "2"), Array.emptyIntArray),
-(0, Array("1", "3"), Array(2))
-  )).toDF("id", "items", "prediction")
-.withColumn("items", col("items").cast(ArrayType(dt)))
-.withColumn("prediction", col("prediction").cast(ArrayType(dt)))
-  assert(expectedTransformed.collect().toSet.equals(
-transformed.collect().toSet))
+  class DataTypeWithEncoder[A](val a: DataType)
+  (implicit val encoder: Encoder[(Int, 
Array[A], Array[A])])
--- End diff --

This class is needed for two purposes:
1. to connect data types with their corresponding DataType. 
Note: this information is already available in  AtomicType as InternalType, 
but it's not accessible. Using it from this test doesn't justify making it 
public.
2. to get the proper encoder to the testTransformer method. As the 
datatypes are put into an array dt is inferred to be their parent type, and 
implicit search is able to find the encoders only for concrete types. 
For a similar reason, we need to use the type of the final encoder. If we 
have only the encoder for A implicit search will not be able to construct 
Array[A], as we have implicit encoders for Array[Int], Array[Short]... but not 
for generic A, having an encoder.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20153
  
**[Test build #85968 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85968/testReport)**
 for PR 20153 at commit 
[`b8a700d`](https://github.com/apache/spark/commit/b8a700d87d3708bae34054a00ad5d489280e5852).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

2018-01-11 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20153
  
I think `ColumnarBatchScan` is fine, `SupportsScanColumnarBatch` also has a 
`enableBatchRead` to fallback.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...

2018-01-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20217#discussion_r160960075
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -255,26 +255,67 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> _ = spark.udf.register("stringLengthInt", len, IntegerType())
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
+"""
+
+# This is to check whether the input function is a wrapped/native 
UserDefinedFunction
+if hasattr(f, 'asNondeterministic'):
+raise ValueError("Please use registerUDF for registering UDF. 
The expected function of "
+ "registerFunction is a Python function 
(including lambda function)")
+udf = UserDefinedFunction(f, returnType=returnType, name=name,
+  evalType=PythonEvalType.SQL_BATCHED_UDF)
+self._jsparkSession.udf().registerPython(name, udf._judf)
+return udf._wrapped()
+
+@ignore_unicode_prefix
+@since(2.3)
+def registerUDF(self, name, f):
+"""Registers a :class:`UserDefinedFunction`. The registered UDF 
can be used in SQL
+statement.
+
+:param name: name of the UDF
--- End diff --

Because I thought it registers it to SQL statement with the given name and 
the defined UDF instance (`f`) shouldn't change its name itself at the register 
time.

Up to my knowledge, we happened to have `registerUDF` to avoid such problem 
in `returnType`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20235: [Spark-22887][ML][TESTS][WIP] ML test for StructuredStre...

2018-01-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20235
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20151: [SPARK-22959][PYTHON] Configuration to select the module...

2018-01-11 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20151
  
Will merge this one if there isn't any objection. I believe this doesn't 
affect the existing code path anyway .. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test sui...

2018-01-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20218


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20217#discussion_r160958943
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -255,26 +255,67 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> _ = spark.udf.register("stringLengthInt", len, IntegerType())
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
+"""
+
+# This is to check whether the input function is a wrapped/native 
UserDefinedFunction
+if hasattr(f, 'asNondeterministic'):
+raise ValueError("Please use registerUDF for registering UDF. 
The expected function of "
+ "registerFunction is a Python function 
(including lambda function)")
+udf = UserDefinedFunction(f, returnType=returnType, name=name,
+  evalType=PythonEvalType.SQL_BATCHED_UDF)
+self._jsparkSession.udf().registerPython(name, udf._judf)
+return udf._wrapped()
+
+@ignore_unicode_prefix
+@since(2.3)
+def registerUDF(self, name, f):
+"""Registers a :class:`UserDefinedFunction`. The registered UDF 
can be used in SQL
+statement.
+
+:param name: name of the UDF
--- End diff --

Why?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20234: [SPARK-19732] [Follow-up] Document behavior chang...

2018-01-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20234


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20231: [SPARK-23000][TEST-HADOOP2.6] Fix Flaky test suite DataS...

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20231
  
I saw the test history. `DataSourceWithHiveMetastoreCatalogSuite ` still 
can pass 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20234: [SPARK-19732] [Follow-up] Document behavior changes made...

2018-01-11 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20234
  
Merged to master and branch-2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20235: [Spark-22887][ML][TESTS][WIP] ML test for Structu...

2018-01-11 Thread smurakozi

GitHub user smurakozi opened a pull request:

https://github.com/apache/spark/pull/20235

[Spark-22887][ML][TESTS][WIP] ML test for StructuredStreaming: spark.ml.fpm

## What changes were proposed in this pull request?

Converting FPGrowth tests to also check code with structured streaming, 
using the  ML testing infrastructure implemented in SPARK-22882. 

Note: this is a WIP, test with Array[Byte] is not yet working due to some 
datatype issues (Array[Byte] vs Binary).

## How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/smurakozi/spark SPARK-22887

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20235.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20235


commit 331129556003bcf6e4bab6559e80e46ac0858706
Author: Sandor Murakozi 
Date:   2018-01-05T12:41:53Z

test 'FPGrowthModel setMinConfidence should affect rules generation and 
transform' is converted to use testTransformer

commit 93aff2c999eee4a88f7f4a3c32d6c7b601a918ac
Author: Sandor Murakozi 
Date:   2018-01-08T13:14:38Z

Test 'FPGrowth fit and transform with different data types' works with 
streaming, except for Byte

commit 8b0b00070a21bd47537a7c3ad580e2af38a481bd
Author: Sandor Murakozi 
Date:   2018-01-11T11:28:46Z

All tests use testTransformer.
Test with Array[Byte] is missing.

commit af61845ab6acfa82c4411bce3ab4a20afebd0aa3
Author: Sandor Murakozi 
Date:   2018-01-11T11:49:27Z

Unintentional changes in 93aff2c999 are reverted




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

2018-01-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20153#discussion_r160958116
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java
 ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import java.util.List;
+
+import org.apache.spark.annotation.InterfaceStability;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+
+/**
+ * A mix-in interface for {@link DataSourceV2Reader}. Data source readers 
can implement this
+ * interface to output {@link ColumnarBatch} and make the scan faster.
+ */
+@InterfaceStability.Evolving
+public interface SupportsScanColumnarBatch extends DataSourceV2Reader {
+  @Override
+  default List createReadTasks() {
+throw new IllegalStateException(
+  "createReadTasks should not be called with 
SupportsScanColumnarBatch.");
+  }
+
+  /**
+   * Similar to {@link DataSourceV2Reader#createReadTasks()}, but returns 
columnar data in batches.
+   */
+  List createBatchReadTasks();
+
+  /**
+   * A safety door for columnar batch reader. It's possible that the 
implementation can only support
+   * some certain columns with certain types. Users can overwrite this 
method and
+   * {@link #createReadTasks()} to fallback to normal read path under some 
conditions.
+   */
+  default boolean enableBatchRead() {
--- End diff --

Yea you can interpret it in this way (read data from columnar storage or 
row storage), but we can also interpret it as reading a batch of records at a 
time or one record at a time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpark

2018-01-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20217
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85964/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpark

2018-01-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20217
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpark

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20217
  
**[Test build #85964 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85964/testReport)**
 for PR 20217 at commit 
[`065a21b`](https://github.com/apache/spark/commit/065a21b879bde3e3de2c506d12e8dec3ed48caab).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20223#discussion_r160957301
  
--- Diff: 
core/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java ---
@@ -137,7 +139,9 @@ public void testInProcessLauncher() throws Exception {
   // Here DAGScheduler is stopped, while 
SparkContext.clearActiveContext may not be called yet.
   // Wait for a reasonable amount of time to avoid creating two active 
SparkContext in JVM.
--- End diff --

why is `500 ms` not a reasonable waiting time anymore?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20189: [SPARK-22975][SS] MetricsReporter should not throw excep...

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20189
  
**[Test build #85967 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85967/testReport)**
 for PR 20189 at commit 
[`c67e07b`](https://github.com/apache/spark/commit/c67e07b66726d57c2b7a3462b226439518d9ebed).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20223#discussion_r160956968
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java ---
@@ -95,15 +95,15 @@ protected synchronized void send(Message msg) throws 
IOException {
   }
 
   @Override
-  public void close() throws IOException {
+  public synchronized void close() throws IOException {
 if (!closed) {
-  synchronized (this) {
--- End diff --

what's wrong with this? It's a classic "double-checked locking" in java.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20234: [SPARK-19732] [Follow-up] Document behavior changes made...

2018-01-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20234
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85965/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20234: [SPARK-19732] [Follow-up] Document behavior changes made...

2018-01-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20234
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20234: [SPARK-19732] [Follow-up] Document behavior changes made...

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20234
  
**[Test build #85965 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85965/testReport)**
 for PR 20234 at commit 
[`a2475ea`](https://github.com/apache/spark/commit/a2475ea5b86acee2380884db0756a833016b69a0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20199: [Spark-22967][TESTS]Fix VersionSuite's unit tests...

2018-01-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20199


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20222
  
**[Test build #85966 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85966/testReport)**
 for PR 20222 at commit 
[`5eded03`](https://github.com/apache/spark/commit/5eded033a0b352e7a799c7890131d8075475c8ff).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20199: [Spark-22967][TESTS]Fix VersionSuite's unit tests by cha...

2018-01-11 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20199
  
Merged to master and branch-2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20218
  
Thanks! Merged to master/2.3.

Hopefully, this can fix the test failure. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...

2018-01-11 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20222
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20234: [SPARK-19732] [Follow-up] Document behavior changes made...

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20234
  
**[Test build #85965 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85965/testReport)**
 for PR 20234 at commit 
[`a2475ea`](https://github.com/apache/spark/commit/a2475ea5b86acee2380884db0756a833016b69a0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...

2018-01-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20217#discussion_r160953629
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -255,26 +255,67 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> _ = spark.udf.register("stringLengthInt", len, IntegerType())
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
+"""
+
+# This is to check whether the input function is a wrapped/native 
UserDefinedFunction
+if hasattr(f, 'asNondeterministic'):
+raise ValueError("Please use registerUDF for registering UDF. 
The expected function of "
+ "registerFunction is a Python function 
(including lambda function)")
+udf = UserDefinedFunction(f, returnType=returnType, name=name,
+  evalType=PythonEvalType.SQL_BATCHED_UDF)
+self._jsparkSession.udf().registerPython(name, udf._judf)
+return udf._wrapped()
+
+@ignore_unicode_prefix
+@since(2.3)
+def registerUDF(self, name, f):
+"""Registers a :class:`UserDefinedFunction`. The registered UDF 
can be used in SQL
+statement.
+
+:param name: name of the UDF
--- End diff --

Hm .. actually I think we should not change the name of the returned UDF?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20223#discussion_r160952457
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java ---
@@ -91,10 +92,15 @@ LauncherConnection getConnection() {
 return connection;
   }
 
-  boolean isDisposed() {
+  synchronized boolean isDisposed() {
 return disposed;
--- End diff --

can we simply mark `disposed` as `transient`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20234: [SPARK-19732] [Follow-up] Document behavior chang...

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20234#discussion_r160952123
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1788,12 +1788,10 @@ options.
 Note that, for DecimalType(38,0)*, the table above 
intentionally does not cover all other combinations of scales and precisions 
because currently we only infer decimal type like `BigInteger`/`BigInt`. For 
example, 1.1 is inferred as double type.
   - In PySpark, now we need Pandas 0.19.2 or upper if you want to use 
Pandas related functionalities, such as `toPandas`, `createDataFrame` from 
Pandas DataFrame, etc.
   - In PySpark, the behavior of timestamp values for Pandas related 
functionalities was changed to respect session timezone. If you want to use the 
old behavior, you need to set a configuration 
`spark.sql.execution.pandas.respectSessionTimeZone` to `False`. See 
[SPARK-22395](https://issues.apache.org/jira/browse/SPARK-22395) for details.
-
- - Since Spark 2.3, when either broadcast hash join or broadcast nested 
loop join is applicable, we prefer to broadcasting the table that is explicitly 
specified in a broadcast hint. For details, see the section [Broadcast 
Hint](#broadcast-hint-for-sql-queries) and 
[SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489).
-
- - Since Spark 2.3, when all inputs are binary, `functions.concat()` 
returns an output as binary. Otherwise, it returns as a string. Until Spark 
2.3, it always returns as a string despite of input types. To keep the old 
behavior, set `spark.sql.function.concatBinaryAsString` to `true`.
-
- - Since Spark 2.3, when all inputs are binary, SQL `elt()` returns an 
output as binary. Otherwise, it returns as a string. Until Spark 2.3, it always 
returns as a string despite of input types. To keep the old behavior, set 
`spark.sql.function.eltOutputAsString` to `true`.
+  - In PySpark, `na.fill()` or `fillna` also accepts boolean and replaces 
NAs with booleans. In prior Spark versions, PySpark just ignores it and returns 
the original Dataset/DataFrame.  
--- End diff --

Sounds good to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpark

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20217
  
**[Test build #85964 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85964/testReport)**
 for PR 20217 at commit 
[`065a21b`](https://github.com/apache/spark/commit/065a21b879bde3e3de2c506d12e8dec3ed48caab).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20217#discussion_r160950622
  
--- Diff: python/pyspark/sql/context.py ---
@@ -578,6 +606,9 @@ def __init__(self, sqlContext):
 def register(self, name, f, returnType=StringType()):
 return self.sqlContext.registerFunction(name, f, returnType)
 
+def registerUDF(self, name, f):
--- End diff --

The examples in these docs are different. Thus, I prefer to keeping it 
untouched.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20217#discussion_r160950521
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -255,26 +255,67 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> _ = spark.udf.register("stringLengthInt", len, IntegerType())
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
+"""
+
+# This is to check whether the input function is a wrapped/native 
UserDefinedFunction
+if hasattr(f, 'asNondeterministic'):
+raise ValueError("Please use registerUDF for registering UDF. 
The expected function of "
+ "registerFunction is a Python function 
(including lambda function)")
+udf = UserDefinedFunction(f, returnType=returnType, name=name,
+  evalType=PythonEvalType.SQL_BATCHED_UDF)
+self._jsparkSession.udf().registerPython(name, udf._judf)
+return udf._wrapped()
+
+@ignore_unicode_prefix
+@since(2.3)
+def registerUDF(self, name, f):
+"""Registers a :class:`UserDefinedFunction`. The registered UDF 
can be used in SQL
+statement.
+
+:param name: name of the UDF
+:param f: a wrapped/native UserDefinedFunction. The UDF can be 
either row-at-a-time or
+  scalar vectorized. Grouped vectorized UDFs are not 
supported.
+:return: a wrapped :class:`UserDefinedFunction`
+
+>>> from pyspark.sql.types import IntegerType
+>>> from pyspark.sql.functions import udf
+>>> slen = udf(lambda s: len(s), IntegerType())
+>>> _ = spark.udf.registerUDF("slen", slen)
+>>> spark.sql("SELECT slen('test')").collect()
+[Row(slen(test)=4)]
 
 >>> import random
 >>> from pyspark.sql.functions import udf
->>> from pyspark.sql.types import IntegerType, StringType
+>>> from pyspark.sql.types import IntegerType
 >>> random_udf = udf(lambda: random.randint(0, 100), 
IntegerType()).asNondeterministic()
->>> newRandom_udf = spark.catalog.registerFunction("random_udf", 
random_udf, StringType())
+>>> newRandom_udf = spark.catalog.registerUDF("random_udf", 
random_udf)
 >>> spark.sql("SELECT random_udf()").collect()  # doctest: +SKIP
-[Row(random_udf()=u'82')]
+[Row(random_udf()=82)]
 >>> spark.range(1).select(newRandom_udf()).collect()  # doctest: 
+SKIP
-[Row(random_udf()=u'62')]
+[Row(random_udf()=62)]
+
+>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
+>>> @pandas_udf("integer", PandasUDFType.SCALAR)  # doctest: +SKIP
+... def add_one(x):
+... return x + 1
+...
+>>> _ = spark.udf.registerUDF("add_one", add_one)  # doctest: +SKIP
+>>> spark.sql("SELECT add_one(id) FROM range(10)").collect()  # 
doctest: +SKIP
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20217#discussion_r160950460
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -255,26 +255,67 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> _ = spark.udf.register("stringLengthInt", len, IntegerType())
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
+"""
+
+# This is to check whether the input function is a wrapped/native 
UserDefinedFunction
+if hasattr(f, 'asNondeterministic'):
+raise ValueError("Please use registerUDF for registering UDF. 
The expected function of "
+ "registerFunction is a Python function 
(including lambda function)")
+udf = UserDefinedFunction(f, returnType=returnType, name=name,
+  evalType=PythonEvalType.SQL_BATCHED_UDF)
+self._jsparkSession.udf().registerPython(name, udf._judf)
+return udf._wrapped()
+
+@ignore_unicode_prefix
+@since(2.3)
+def registerUDF(self, name, f):
+"""Registers a :class:`UserDefinedFunction`. The registered UDF 
can be used in SQL
+statement.
+
+:param name: name of the UDF
+:param f: a wrapped/native UserDefinedFunction. The UDF can be 
either row-at-a-time or
+  scalar vectorized. Grouped vectorized UDFs are not 
supported.
--- End diff --

I think the following examples are self descriptive. I can emphasize it in 
the description too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20217#discussion_r160950226
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -255,26 +255,67 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> _ = spark.udf.register("stringLengthInt", len, IntegerType())
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
+"""
+
+# This is to check whether the input function is a wrapped/native 
UserDefinedFunction
+if hasattr(f, 'asNondeterministic'):
+raise ValueError("Please use registerUDF for registering UDF. 
The expected function of "
+ "registerFunction is a Python function 
(including lambda function)")
+udf = UserDefinedFunction(f, returnType=returnType, name=name,
+  evalType=PythonEvalType.SQL_BATCHED_UDF)
+self._jsparkSession.udf().registerPython(name, udf._judf)
+return udf._wrapped()
+
+@ignore_unicode_prefix
+@since(2.3)
+def registerUDF(self, name, f):
+"""Registers a :class:`UserDefinedFunction`. The registered UDF 
can be used in SQL
+statement.
+
+:param name: name of the UDF
+:param f: a wrapped/native UserDefinedFunction. The UDF can be 
either row-at-a-time or
--- End diff --

> :param udf: A function object returned by 
:meth:`pyspark.sql.functions.pandas_udf`

This is actually not accurate. We have another way to define the scalar 
vectorized UDF.
```
>>> @pandas_udf("integer", PandasUDFType.SCALAR)  # doctest: +SKIP
... def add_one(x):
... return x + 1
...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...

2018-01-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20217#discussion_r160949917
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -255,26 +255,67 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> _ = spark.udf.register("stringLengthInt", len, IntegerType())
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
+"""
+
+# This is to check whether the input function is a wrapped/native 
UserDefinedFunction
+if hasattr(f, 'asNondeterministic'):
--- End diff --

Nope, to be clear I am fine with it as is.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20217#discussion_r160949520
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -255,26 +255,67 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> _ = spark.udf.register("stringLengthInt", len, IntegerType())
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
+"""
+
+# This is to check whether the input function is a wrapped/native 
UserDefinedFunction
+if hasattr(f, 'asNondeterministic'):
+raise ValueError("Please use registerUDF for registering UDF. 
The expected function of "
+ "registerFunction is a Python function 
(including lambda function)")
+udf = UserDefinedFunction(f, returnType=returnType, name=name,
+  evalType=PythonEvalType.SQL_BATCHED_UDF)
+self._jsparkSession.udf().registerPython(name, udf._judf)
+return udf._wrapped()
+
+@ignore_unicode_prefix
+@since(2.3)
+def registerUDF(self, name, f):
+"""Registers a :class:`UserDefinedFunction`. The registered UDF 
can be used in SQL
+statement.
+
+:param name: name of the UDF
--- End diff --

Actually, `name of the UDF in SQL statement` is wrong. The name is also 
used as the name of returned UDF.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20234: [SPARK-19732] [Follow-up] Document behavior chang...

2018-01-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20234#discussion_r160949121
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1788,12 +1788,10 @@ options.
 Note that, for DecimalType(38,0)*, the table above 
intentionally does not cover all other combinations of scales and precisions 
because currently we only infer decimal type like `BigInteger`/`BigInt`. For 
example, 1.1 is inferred as double type.
   - In PySpark, now we need Pandas 0.19.2 or upper if you want to use 
Pandas related functionalities, such as `toPandas`, `createDataFrame` from 
Pandas DataFrame, etc.
   - In PySpark, the behavior of timestamp values for Pandas related 
functionalities was changed to respect session timezone. If you want to use the 
old behavior, you need to set a configuration 
`spark.sql.execution.pandas.respectSessionTimeZone` to `False`. See 
[SPARK-22395](https://issues.apache.org/jira/browse/SPARK-22395) for details.
-
- - Since Spark 2.3, when either broadcast hash join or broadcast nested 
loop join is applicable, we prefer to broadcasting the table that is explicitly 
specified in a broadcast hint. For details, see the section [Broadcast 
Hint](#broadcast-hint-for-sql-queries) and 
[SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489).
-
- - Since Spark 2.3, when all inputs are binary, `functions.concat()` 
returns an output as binary. Otherwise, it returns as a string. Until Spark 
2.3, it always returns as a string despite of input types. To keep the old 
behavior, set `spark.sql.function.concatBinaryAsString` to `true`.
-
- - Since Spark 2.3, when all inputs are binary, SQL `elt()` returns an 
output as binary. Otherwise, it returns as a string. Until Spark 2.3, it always 
returns as a string despite of input types. To keep the old behavior, set 
`spark.sql.function.eltOutputAsString` to `true`.
+  - In PySpark, `na.fill()` or `fillna` also accepts boolean and replaces 
NAs with booleans. In prior Spark versions, PySpark just ignores it and returns 
the original Dataset/DataFrame.  
--- End diff --

Shall we say `null` instead of `NA`? I actually think `null` is more 
correct.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20217#discussion_r160949342
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -255,26 +255,67 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> _ = spark.udf.register("stringLengthInt", len, IntegerType())
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
+"""
+
+# This is to check whether the input function is a wrapped/native 
UserDefinedFunction
+if hasattr(f, 'asNondeterministic'):
--- End diff --

The current way is simple. Any hole?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpar...

2018-01-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20217#discussion_r160948952
  
--- Diff: python/pyspark/sql/context.py ---
@@ -203,18 +203,46 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> _ = sqlContext.udf.register("stringLengthInt", lambda x: 
len(x), IntegerType())
 >>> sqlContext.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
+"""
+return self.sparkSession.catalog.registerFunction(name, f, 
returnType)
+
+@ignore_unicode_prefix
+@since(2.3)
+def registerUDF(self, name, f):
--- End diff --

The test cases are actually different. In this file, all the examples are 
using sqlContext.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20234: [SPARK-19732] [Follow-up] Document behavior changes made...

2018-01-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20234
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20234: [SPARK-19732] [Follow-up] Document behavior changes made...

2018-01-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20234
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85963/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20234: [SPARK-19732] [Follow-up] Document behavior changes made...

2018-01-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20234
  
**[Test build #85963 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85963/testReport)**
 for PR 20234 at commit 
[`ff30553`](https://github.com/apache/spark/commit/ff30553092a7bfe8d9aac3fc1f89b99ff679a2aa).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...

2018-01-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20218
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85959/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20218: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...

2018-01-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20218
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 >

401 - 500 of 565 matches

Mail list logo