date:20180429

[GitHub] spark pull request #21173: [SPARK-23856][SQL] Add an option `queryTimeout` i...

2018-04-29 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21173#discussion_r184909615
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
 ---
@@ -56,6 +56,7 @@ object JDBCRDD extends Logging {
 val conn: Connection = JdbcUtils.createConnectionFactory(options)()
 try {
   val statement = conn.prepareStatement(dialect.getSchemaQuery(table))
+  statement.setQueryTimeout(options.queryTimeout)
--- End diff --

Since `setQueryTimeout` can raise `SQLException`, we had better keep this 
inside `try` (line 60). Otherwise, `statement.close()` will not be called.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21064: [SPARK-23976][Core] Detect length overflow in UTF8String...

2018-04-29 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/21064
  
ping @hvanhovell


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21190: [SPARK-22938][SQL][followup] Assert that SQLConf.get is ...

2018-04-29 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/21190
  
@cloud-fan . Thank you for investigating this. Could you fix 
[jsonExpressions.scala Line 
520](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L520),
 too?

```
Caused by: java.lang.IllegalStateException: SQLConf should only be created 
and accessed on the driver.
  at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:113)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.(jsonExpressions.scala:520)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21021: [SPARK-23921][SQL] Add array_sort function

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21021
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21021: [SPARK-23921][SQL] Add array_sort function

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21021
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89975/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21021: [SPARK-23921][SQL] Add array_sort function

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21021
  
**[Test build #89975 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89975/testReport)**
 for PR 21021 at commit 
[`d1b0483`](https://github.com/apache/spark/commit/d1b048321b68216c4e1656e5b081907cdfcb8f49).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

2018-04-29 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20959


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

2018-04-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20959
  
LGTM

Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21191: [MINOR][DOCS] Fix a broken link for Arrow's suppo...

2018-04-29 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21191


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21191: [MINOR][DOCS] Fix a broken link for Arrow's supported ty...

2018-04-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21191
  
Merged to master and branch-2.3.

Thanks @felixcheung.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21141: [SPARK-23853][PYSPARK][TEST] Run Hive-related PySpark te...

2018-04-29 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/21141
  
Thank you for review and approval, @HyukjinKwon .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21177: [SPARK-24111][SQL] Add the TPCDS v2.7 (latest) queries i...

2018-04-29 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21177
  
ping @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21173: [SPARK-23856][SQL] Add an option `queryTimeout` in JDBCO...

2018-04-29 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21173
  
ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21141: [SPARK-23853][PYSPARK][TEST] Run Hive-related PySpark te...

2018-04-29 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/21141
  
The PR is updated now. Could you review this again, @holdenk, @HyukjinKwon 
, @felixcheung , @bersprockets ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21192
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21192
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89972/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21192
  
**[Test build #89972 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89972/testReport)**
 for PR 21192 at commit 
[`60d5828`](https://github.com/apache/spark/commit/60d5828df1b81b17eedf0bf5d307e4cef2f4453b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21021: [SPARK-23921][SQL] Add array_sort function

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21021
  
**[Test build #89975 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89975/testReport)**
 for PR 21021 at commit 
[`d1b0483`](https://github.com/apache/spark/commit/d1b048321b68216c4e1656e5b081907cdfcb8f49).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21021: [SPARK-23921][SQL] Add array_sort function

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21021
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21021: [SPARK-23921][SQL] Add array_sort function

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21021
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2749/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21021: [SPARK-23921][SQL] Add array_sort function

2018-04-29 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/21021
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21141: [SPARK-23853][PYSPARK][TEST] Run Hive-related PySpark te...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21141
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89974/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21141: [SPARK-23853][PYSPARK][TEST] Run Hive-related PySpark te...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21141
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21141: [SPARK-23853][PYSPARK][TEST] Run Hive-related PySpark te...

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21141
  
**[Test build #89974 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89974/testReport)**
 for PR 21141 at commit 
[`271e152`](https://github.com/apache/spark/commit/271e15201cf24b6203e7109e56178fd22f45caf0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21192
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21192
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89973/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21192
  
**[Test build #89973 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89973/testReport)**
 for PR 21192 at commit 
[`4b3592f`](https://github.com/apache/spark/commit/4b3592f1c51185080474ba5512ea2c2c1472f902).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21141: [SPARK-23853][PYSPARK][TEST] Run Hive-related PySpark te...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21141
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2748/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21141: [SPARK-23853][PYSPARK][TEST] Run Hive-related PySpark te...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21141
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21141: [SPARK-23853][PYSPARK][TEST] Run Hive-related PySpark te...

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21141
  
**[Test build #89974 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89974/testReport)**
 for PR 21141 at commit 
[`271e152`](https://github.com/apache/spark/commit/271e15201cf24b6203e7109e56178fd22f45caf0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21190: [SPARK-22938][SQL][followup] Assert that SQLConf.get is ...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21190
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21190: [SPARK-22938][SQL][followup] Assert that SQLConf.get is ...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21190
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89971/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21190: [SPARK-22938][SQL][followup] Assert that SQLConf.get is ...

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21190
  
**[Test build #89971 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89971/testReport)**
 for PR 21190 at commit 
[`fc67909`](https://github.com/apache/spark/commit/fc679098d917d226a834a8ab6d08c23dbe5bf7db).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class WidenSetOperationTypes(conf: SQLConf) extends 
Rule[LogicalPlan] `
  * `  case class FunctionArgumentConversion(conf: SQLConf) extends 
TypeCoercionRule `
  * `  case class CaseWhenCoercion(conf: SQLConf) extends TypeCoercionRule `
  * `  case class IfCoercion(conf: SQLConf) extends TypeCoercionRule `
  * `  case class ImplicitTypeCasts(conf: SQLConf) extends TypeCoercionRule 
`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21141: [SPARK-23853][PYSPARK][TEST] Run Hive-related PyS...

2018-04-29 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21141#discussion_r184894916
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3021,6 +3021,17 @@ def test_sort_with_nulls_order(self):
 
 class HiveSparkSubmitTests(SparkSubmitTests):
 
+@classmethod
+def setUpClass(cls):
+import glob
+from pyspark.find_spark_home import _find_spark_home
+
+SPARK_HOME = _find_spark_home()
+filename_pattern = ("sql/hive/target/spark-hive_*-sources.jar")
+if not glob.glob(os.path.join(SPARK_HOME, filename_pattern)):
--- End diff --

This PR will follow #20909 like the other occurrence in 
`HiveContextSQLTests` of this file.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21141: [SPARK-23853][PYSPARK][TEST] Run Hive-related PyS...

2018-04-29 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21141#discussion_r184894865
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3021,6 +3021,17 @@ def test_sort_with_nulls_order(self):
 
 class HiveSparkSubmitTests(SparkSubmitTests):
 
+@classmethod
+def setUpClass(cls):
+import glob
+from pyspark.find_spark_home import _find_spark_home
+
+SPARK_HOME = _find_spark_home()
+filename_pattern = ("sql/hive/target/spark-hive_*-sources.jar")
+if not glob.glob(os.path.join(SPARK_HOME, filename_pattern)):
+raise unittest.SkipTest(
--- End diff --

Yep. According to the latest convention of #21107, it will be displayed 
like this.
```
Skipped tests in pyspark.sql.tests with python2.7:
...
test_hivecontext (pyspark.sql.tests.HiveSparkSubmitTests) ... skipped 
'Hive is not available.'
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21192
  
**[Test build #89973 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89973/testReport)**
 for PR 21192 at commit 
[`4b3592f`](https://github.com/apache/spark/commit/4b3592f1c51185080474ba5512ea2c2c1472f902).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21192
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21192
  
**[Test build #89972 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89972/testReport)**
 for PR 21192 at commit 
[`60d5828`](https://github.com/apache/spark/commit/60d5828df1b81b17eedf0bf5d307e4cef2f4453b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21192: [SPARK-24118][SQL] Flexible format for the lineSe...

2018-04-29 Thread MaxGekk

GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/21192

[SPARK-24118][SQL] Flexible format for the lineSep option of Text and JSON 
datasources

## What changes were proposed in this pull request?

I propose flexible format for the **lineSep** option used in text 
datasources like Json. New format of the option has the following syntax:

```
lineSep ::= (selector separator-spec) | text-separator
selector := 'x' | '\' | reserved-selector
reserved-selector ::= '\' | 'r'
separator-spec ::= < valid string literal in Python, R, Java and Scala>
text-separator ::= first-char separator-spec
first-char ::= ! selector
```

Examples of lineSep in the new format:

```
x0a.00.00.00 0d.00.00.00
x5445
|^|
\r\n
-
sep
```
The `'\'` and `'r'` are reserved for future usage. For instance, `'r'` 
could be used for regular expressions line `r[0-9]+` or `r(x1E|x0Ax1E|x0A)` for 
parsing [Json Streaming](https://en.wikipedia.org/wiki/JSON_streaming)

New format addresses the use cases:

1. Hexadecimal format allows to specify `lineSep` independently from 
encoding. It gives opportunity for reading json files with BOM in per-line 
mode. See https://github.com/apache/spark/pull/20849#issuecomment-377501993

2. Jsons coming usually from embedded systems have not-standard separators 
(invisible in some cases). It is very convenient to open a file in hex editor 
and copy bytes between }{ to the lineSep option. This is the use case for the 
format with `'x'` selector like: `x0d 54 45`

3. In Json Streaming, records could be separated in pretty different ways. 
We should leave room for improvement I believe. See `'r'` (for regexp) and 
`'/'` reserved selectors

4. Some UTF-8 chars could cause errors from style (format) checkers. It is 
easier to represent such chars in hexadecimal format instead of disabling the 
checkers.

5. In near future, json datasource will support input json in different 
charsets. If the source code in UTF-8 but input json in different charset, it 
is slightly hard to put such chars as values for the lineSep option. The 
`x` format is more convenient here again. 


## How was this patch tested?

The changes are checked by 2 new tests in which JSON files in `UTF-16` and 
`UTF-32` with BOM are read. Also 2 new cases for an existing test are added.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 json-flexible-line-sep2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21192.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21192


commit 60d5828df1b81b17eedf0bf5d307e4cef2f4453b
Author: Maxim Gekk 
Date:   2018-04-29T19:33:45Z

Flexible format of the lineSep option




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21141: [SPARK-23853][PYSPARK][TEST] Run Hive-related PySpark te...

2018-04-29 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/21141
  
I see. Thanks, @bersprockets . I'll proceed this PR according to your and 
other peoples comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21021: [SPARK-23921][SQL] Add array_sort function

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21021
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21021: [SPARK-23921][SQL] Add array_sort function

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21021
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89969/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21021: [SPARK-23921][SQL] Add array_sort function

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21021
  
**[Test build #89969 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89969/testReport)**
 for PR 21021 at commit 
[`d1b0483`](https://github.com/apache/spark/commit/d1b048321b68216c4e1656e5b081907cdfcb8f49).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21190: [SPARK-22938][SQL][followup] Assert that SQLConf.get is ...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21190
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2747/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21190: [SPARK-22938][SQL][followup] Assert that SQLConf.get is ...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21190
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21190: [SPARK-22938][SQL][followup] Assert that SQLConf.get is ...

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21190
  
**[Test build #89971 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89971/testReport)**
 for PR 21190 at commit 
[`fc67909`](https://github.com/apache/spark/commit/fc679098d917d226a834a8ab6d08c23dbe5bf7db).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21190: [SPARK-22938][SQL][followup] Assert that SQLConf.get is ...

2018-04-29 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/21190
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21186: [SPARK-22279][SPARK-24112] Enable `convertMetastoreOrc` ...

2018-04-29 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/21186
  
@gatorsmile and @cloud-fan .
Could you review this PR? This is a first try after the reverting (#20536).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21119
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89970/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21119
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21119
  
**[Test build #89970 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89970/testReport)**
 for PR 21119 at commit 
[`ae9f953`](https://github.com/apache/spark/commit/ae9f953d4a06228b6bf7b6867f031a1bfc84d1e2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple f...

2018-04-29 Thread superbobry

Github user superbobry commented on a diff in the pull request:

https://github.com/apache/spark/pull/21180#discussion_r184889165
  
--- Diff: python/pyspark/serializers.py ---
@@ -523,7 +523,21 @@ def namedtuple(*args, **kwargs):
 for k, v in _old_namedtuple_kwdefaults.items():
 kwargs[k] = kwargs.get(k, v)
 cls = _old_namedtuple(*args, **kwargs)
-return _hack_namedtuple(cls)
+
+import sys
+f = sys._getframe(1)
--- End diff --

Another way to fix the module name is to get rid of the extra stack frame. 
This can be done by e.g. modifying the bytecode of `collections.namedtuple` so 
that it redirected to `_hack_namedtuple` when needed. However, I think that the 
current implementation (despite being magical) is better than bytecode hacking 
since it is as magical as `collections.namedtuple` itself.

I can change the PR to do the same as `collections.namedtuple` for 
cross-interpreter compatibility, wdyt?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21119
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2746/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21119
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21119
  
**[Test build #89970 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89970/testReport)**
 for PR 21119 at commit 
[`ae9f953`](https://github.com/apache/spark/commit/ae9f953d4a06228b6bf7b6867f031a1bfc84d1e2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21021: [SPARK-23921][SQL] Add array_sort function

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21021
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2745/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21021: [SPARK-23921][SQL] Add array_sort function

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21021
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21021: [SPARK-23921][SQL] Add array_sort function

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21021
  
**[Test build #89969 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89969/testReport)**
 for PR 21021 at commit 
[`d1b0483`](https://github.com/apache/spark/commit/d1b048321b68216c4e1656e5b081907cdfcb8f49).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21175: [SPARK-24107][CORE] ChunkedByteBuffer.writeFully ...

2018-04-29 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/21175#discussion_r184882338
  
--- Diff: 
core/src/test/scala/org/apache/spark/io/ChunkedByteBufferSuite.scala ---
@@ -20,12 +20,12 @@ package org.apache.spark.io
 import java.nio.ByteBuffer
 
 import com.google.common.io.ByteStreams
-
-import org.apache.spark.SparkFunSuite
+import org.apache.spark.{SparkFunSuite, SharedSparkContext}
--- End diff --

move SharedSparkContext before SparkFunSuite


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21175: [SPARK-24107][CORE] ChunkedByteBuffer.writeFully ...

2018-04-29 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/21175#discussion_r184882396
  
--- Diff: 
core/src/test/scala/org/apache/spark/io/ChunkedByteBufferSuite.scala ---
@@ -20,12 +20,12 @@ package org.apache.spark.io
 import java.nio.ByteBuffer
 
 import com.google.common.io.ByteStreams
--- End diff --

add an empty line behind 22 to separate spark and third-party group.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20959
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89967/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20959
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20959
  
**[Test build #89967 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89967/testReport)**
 for PR 20959 at commit 
[`d4d9d65`](https://github.com/apache/spark/commit/d4d9d65ce28c4176c085449564c8e5f8ec0b3ff7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21133: [SPARK-24013][SQL] Remove unneeded compress in Approxima...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21133
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89966/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21133: [SPARK-24013][SQL] Remove unneeded compress in Approxima...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21133
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21133: [SPARK-24013][SQL] Remove unneeded compress in Approxima...

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21133
  
**[Test build #89966 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89966/testReport)**
 for PR 21133 at commit 
[`d47d9bd`](https://github.com/apache/spark/commit/d47d9bdf564909a6fc8d3bd67cf696b7c1cf0d4b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-29 Thread iddoav

Github user iddoav commented on the issue:

https://github.com/apache/spark/pull/21070
  
Our R in SimilarWeb have hard times with PARQUET-686, and merging this PR 
will help us a lot. Note, that unlike Spark 2.1+ readers which have read-time 
mitigations (SPARK-17213 et al), other systems like CDH5.X's spark and AWS 
athena (probably also presto) do predicate pushdown on Spark 2.3 parquet 
outputs, and return wrong answers when string columns are involved.
@gatorsmile @rdblue 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21082: [SPARK-22239][SQL][Python] Enable grouped aggrega...

2018-04-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21082#discussion_r184875393
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -268,3 +269,38 @@ object PhysicalAggregation {
 case _ => None
   }
 }
+
+/**
+ * An extractor used when planning physical execution of a window. This 
extractor outputs
+ * the window function type of the logical window.
+ *
+ * The input logical window must contain same type of window functions, 
which is ensured by
+ * the rule ExtractWindowExpressions in the analyzer.
+ */
+object PhysicalWindow {
+  // windowFunctionType, windowExpression, partitionSpec, orderSpec, child
+  type ReturnType =
+(WindowFunctionType, Seq[NamedExpression], Seq[Expression], 
Seq[SortOrder], LogicalPlan)
+
+  def unapply(a: Any): Option[ReturnType] = a match {
+case expr @ logical.Window(windowExpressions, partitionSpec, 
orderSpec, child) =>
+
+  if (windowExpressions.isEmpty) {
+throw new AnalysisException(s"Window expression is empty in $expr")
+  }
+
+  val windowFunctionType = 
windowExpressions.map(WindowFunctionType.functionType)
+.reduceLeft ( (t1: WindowFunctionType, t2: WindowFunctionType) =>
--- End diff --

(BTW: 
```
.reduceLeft {
  ...
}
```
)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21191: [MINOR][DOCS] Fix a broken link for Arrow's supported ty...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21191
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89968/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21191: [MINOR][DOCS] Fix a broken link for Arrow's supported ty...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21191
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21191: [MINOR][DOCS] Fix a broken link for Arrow's supported ty...

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21191
  
**[Test build #89968 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89968/testReport)**
 for PR 21191 at commit 
[`b825d8a`](https://github.com/apache/spark/commit/b825d8a0439d6dce6898dbe4e50377e277f35a3f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21082: [SPARK-22239][SQL][Python] Enable grouped aggregate pand...

2018-04-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21082
  
Sorry @icexelloss, I failed to take a close look on this weekend but seems 
fine in general. Will try it in the coming days.

BTW, I think we should cc @hvanhovell too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21082: [SPARK-22239][SQL][Python] Enable grouped aggrega...

2018-04-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21082#discussion_r184875163
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -5181,6 +5190,236 @@ def test_invalid_args(self):
 'mixture.*aggregate function.*group aggregate pandas 
UDF'):
 df.groupby(df.id).agg(mean_udf(df.v), mean(df.v)).collect()
 
+
+@unittest.skipIf(
+not _have_pandas or not _have_pyarrow,
+_pandas_requirement_message or _pyarrow_requirement_message)
+class WindowPandasUDFTests(ReusedSQLTestCase):
+@property
+def data(self):
+from pyspark.sql.functions import array, explode, col, lit
+return self.spark.range(10).toDF('id') \
+.withColumn("vs", array([lit(i * 1.0) + col('id') for i in 
range(20, 30)])) \
+.withColumn("v", explode(col('vs'))) \
+.drop('vs') \
+.withColumn('w', lit(1.0))
+
+@property
+def python_plus_one(self):
+from pyspark.sql.functions import udf
+return udf(lambda v: v + 1, 'double')
+
+@property
+def pandas_scalar_time_two(self):
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+return pandas_udf(lambda v: v * 2, 'double')
+
+@property
+def pandas_agg_mean_udf(self):
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+
+@pandas_udf('double', PandasUDFType.GROUPED_AGG)
+def avg(v):
+return v.mean()
+return avg
+
+@property
+def pandas_agg_max_udf(self):
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+
+@pandas_udf('double', PandasUDFType.GROUPED_AGG)
+def max(v):
+return v.max()
+return max
+
+@property
+def pandas_agg_min_udf(self):
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+
+@pandas_udf('double', PandasUDFType.GROUPED_AGG)
+def min(v):
+return v.min()
+return min
+
+@property
+def unbounded_window(self):
+return Window.partitionBy('id') \
+.rowsBetween(Window.unboundedPreceding, 
Window.unboundedFollowing)
+
+@property
+def ordered_window(self):
+return Window.partitionBy('id').orderBy('v')
+
+@property
+def unpartitioned_window(self):
+return Window.partitionBy()
+
+def test_simple(self):
+from pyspark.sql.functions import pandas_udf, PandasUDFType, 
percent_rank, mean, max
+
+df = self.data
+w = self.unbounded_window
+
+mean_udf = self.pandas_agg_mean_udf
+
+result1 = df.withColumn('mean_v', mean_udf(df['v']).over(w))
+expected1 = df.withColumn('mean_v', mean(df['v']).over(w))
+
+result2 = df.select(mean_udf(df['v']).over(w))
+expected2 = df.select(mean(df['v']).over(w))
+
+self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
+self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
+
+def test_multiple_udfs(self):
+from pyspark.sql.functions import max, min, mean
+
+df = self.data
+w = self.unbounded_window
+
+result1 = df.withColumn('mean_v', 
self.pandas_agg_mean_udf(df['v']).over(w)) \
+.withColumn('max_v', 
self.pandas_agg_max_udf(df['v']).over(w)) \
+.withColumn('min_w', 
self.pandas_agg_min_udf(df['w']).over(w)) \
+
+expected1 = df.withColumn('mean_v', mean(df['v']).over(w)) \
+  .withColumn('max_v', max(df['v']).over(w)) \
+  .withColumn('min_w', min(df['w']).over(w))
+
+self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
+
+def test_replace_existing(self):
+from pyspark.sql.functions import mean
+
+df = self.data
+w = self.unbounded_window
+
+result1 = df.withColumn('v', 
self.pandas_agg_mean_udf(df['v']).over(w))
+expected1 = df.withColumn('v', mean(df['v']).over(w))
+
+self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
+
+def test_mixed_sql(self):
+from pyspark.sql.functions import mean
+
+df = self.data
+w = self.unbounded_window
+mean_udf = self.pandas_agg_mean_udf
+
+result1 = df.withColumn('v', mean_udf(df['v'] * 2).over(w) + 1)
+expected1 = df.withColumn('v', mean(df['v'] * 2).over(w) + 1)
+
+self.assertPandasEqual(expected1.toPandas(), result1.toPandas())

[GitHub] spark pull request #21082: [SPARK-22239][SQL][Python] Enable grouped aggrega...

2018-04-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21082#discussion_r184875140
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
 ---
@@ -0,0 +1,174 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import java.io.File
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.{SparkEnv, TaskContext}
+import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.physical._
+import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, 
UnaryExecNode}
+import org.apache.spark.sql.types.{DataType, StructField, StructType}
+import org.apache.spark.util.Utils
+
+case class WindowInPandasExec(
+windowExpression: Seq[NamedExpression],
+partitionSpec: Seq[Expression],
+orderSpec: Seq[SortOrder],
+child: SparkPlan) extends UnaryExecNode {
+
+  override def output: Seq[Attribute] =
+child.output ++ windowExpression.map(_.toAttribute)
+
+  override def requiredChildDistribution: Seq[Distribution] = {
+if (partitionSpec.isEmpty) {
+  // Only show warning when the number of bytes is larger than 100 MB?
+  logWarning("No Partition Defined for Window operation! Moving all 
data to a single "
++ "partition, this can cause serious performance degradation.")
+  AllTuples :: Nil
+} else ClusteredDistribution(partitionSpec) :: Nil
+  }
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
+Seq(partitionSpec.map(SortOrder(_, Ascending)) ++ orderSpec)
+
+  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, 
Seq[Expression]) = {
+udf.children match {
+  case Seq(u: PythonUDF) =>
+val (chained, children) = collectFunctions(u)
+(ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children)
+  case children =>
+// There should not be any other UDFs, or the children can't be 
evaluated directly.
+assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty))
+(ChainedPythonFunctions(Seq(udf.func)), udf.children)
+}
+  }
+
+  /**
+   * Create the resulting projection.
+   *
+   * This method uses Code Generation. It can only be used on the executor 
side.
+   *
+   * @param expressions unbound ordered function expressions.
+   * @return the final resulting projection.
+   */
+  private[this] def createResultProjection(expressions: Seq[Expression]): 
UnsafeProjection = {
+val references = expressions.zipWithIndex.map{ case (e, i) =>
+  // Results of window expressions will be on the right side of 
child's output
+  BoundReference(child.output.size + i, e.dataType, e.nullable)
+}
+val unboundToRefMap = expressions.zip(references).toMap
+val patchedWindowExpression = 
windowExpression.map(_.transform(unboundToRefMap))
+UnsafeProjection.create(
+  child.output ++ patchedWindowExpression,
+  child.output)
+  }
+
+  protected override def doExecute(): RDD[InternalRow] = {
+val inputRDD = child.execute()
+
+val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
+val reuseWorker = 
inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
+val sessionLocalTimeZone = conf.sessionLocalTimeZone
+val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
+
+// Extract window expressions and window

[GitHub] spark pull request #21082: [SPARK-22239][SQL][Python] Enable grouped aggrega...

2018-04-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21082#discussion_r184875120
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
 ---
@@ -0,0 +1,174 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import java.io.File
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.{SparkEnv, TaskContext}
+import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.physical._
+import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, 
UnaryExecNode}
+import org.apache.spark.sql.types.{DataType, StructField, StructType}
+import org.apache.spark.util.Utils
+
+case class WindowInPandasExec(
+windowExpression: Seq[NamedExpression],
+partitionSpec: Seq[Expression],
+orderSpec: Seq[SortOrder],
+child: SparkPlan) extends UnaryExecNode {
+
+  override def output: Seq[Attribute] =
+child.output ++ windowExpression.map(_.toAttribute)
+
+  override def requiredChildDistribution: Seq[Distribution] = {
+if (partitionSpec.isEmpty) {
+  // Only show warning when the number of bytes is larger than 100 MB?
+  logWarning("No Partition Defined for Window operation! Moving all 
data to a single "
++ "partition, this can cause serious performance degradation.")
+  AllTuples :: Nil
+} else ClusteredDistribution(partitionSpec) :: Nil
+  }
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
+Seq(partitionSpec.map(SortOrder(_, Ascending)) ++ orderSpec)
+
+  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, 
Seq[Expression]) = {
+udf.children match {
+  case Seq(u: PythonUDF) =>
+val (chained, children) = collectFunctions(u)
+(ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children)
+  case children =>
+// There should not be any other UDFs, or the children can't be 
evaluated directly.
+assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty))
+(ChainedPythonFunctions(Seq(udf.func)), udf.children)
+}
+  }
+
+  /**
+   * Create the resulting projection.
+   *
+   * This method uses Code Generation. It can only be used on the executor 
side.
+   *
+   * @param expressions unbound ordered function expressions.
+   * @return the final resulting projection.
+   */
+  private[this] def createResultProjection(expressions: Seq[Expression]): 
UnsafeProjection = {
+val references = expressions.zipWithIndex.map{ case (e, i) =>
+  // Results of window expressions will be on the right side of 
child's output
+  BoundReference(child.output.size + i, e.dataType, e.nullable)
+}
+val unboundToRefMap = expressions.zip(references).toMap
+val patchedWindowExpression = 
windowExpression.map(_.transform(unboundToRefMap))
+UnsafeProjection.create(
+  child.output ++ patchedWindowExpression,
+  child.output)
+  }
+
+  protected override def doExecute(): RDD[InternalRow] = {
+val inputRDD = child.execute()
+
+val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
+val reuseWorker = 
inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
+val sessionLocalTimeZone = conf.sessionLocalTimeZone
+val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
+
+// Extract window expressions and window

[GitHub] spark pull request #21082: [SPARK-22239][SQL][Python] Enable grouped aggrega...

2018-04-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21082#discussion_r184875102
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
 ---
@@ -0,0 +1,174 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import java.io.File
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.{SparkEnv, TaskContext}
+import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.physical._
+import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, 
UnaryExecNode}
+import org.apache.spark.sql.types.{DataType, StructField, StructType}
+import org.apache.spark.util.Utils
+
+case class WindowInPandasExec(
+windowExpression: Seq[NamedExpression],
+partitionSpec: Seq[Expression],
+orderSpec: Seq[SortOrder],
+child: SparkPlan) extends UnaryExecNode {
+
+  override def output: Seq[Attribute] =
+child.output ++ windowExpression.map(_.toAttribute)
+
+  override def requiredChildDistribution: Seq[Distribution] = {
+if (partitionSpec.isEmpty) {
+  // Only show warning when the number of bytes is larger than 100 MB?
+  logWarning("No Partition Defined for Window operation! Moving all 
data to a single "
++ "partition, this can cause serious performance degradation.")
+  AllTuples :: Nil
+} else ClusteredDistribution(partitionSpec) :: Nil
+  }
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
+Seq(partitionSpec.map(SortOrder(_, Ascending)) ++ orderSpec)
+
+  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, 
Seq[Expression]) = {
+udf.children match {
+  case Seq(u: PythonUDF) =>
+val (chained, children) = collectFunctions(u)
+(ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children)
+  case children =>
+// There should not be any other UDFs, or the children can't be 
evaluated directly.
+assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty))
+(ChainedPythonFunctions(Seq(udf.func)), udf.children)
+}
+  }
+
+  /**
+   * Create the resulting projection.
+   *
+   * This method uses Code Generation. It can only be used on the executor 
side.
+   *
+   * @param expressions unbound ordered function expressions.
+   * @return the final resulting projection.
+   */
+  private[this] def createResultProjection(expressions: Seq[Expression]): 
UnsafeProjection = {
+val references = expressions.zipWithIndex.map{ case (e, i) =>
+  // Results of window expressions will be on the right side of 
child's output
+  BoundReference(child.output.size + i, e.dataType, e.nullable)
+}
+val unboundToRefMap = expressions.zip(references).toMap
+val patchedWindowExpression = 
windowExpression.map(_.transform(unboundToRefMap))
+UnsafeProjection.create(
+  child.output ++ patchedWindowExpression,
+  child.output)
+  }
+
+  protected override def doExecute(): RDD[InternalRow] = {
+val inputRDD = child.execute()
+
+val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
+val reuseWorker = 
inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
+val sessionLocalTimeZone = conf.sessionLocalTimeZone
+val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
+
+// Extract window expressions and window

[GitHub] spark pull request #21082: [SPARK-22239][SQL][Python] Enable grouped aggrega...

2018-04-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21082#discussion_r184875086
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
 ---
@@ -0,0 +1,174 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import java.io.File
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.{SparkEnv, TaskContext}
+import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.physical._
+import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, 
UnaryExecNode}
+import org.apache.spark.sql.types.{DataType, StructField, StructType}
+import org.apache.spark.util.Utils
+
+case class WindowInPandasExec(
+windowExpression: Seq[NamedExpression],
+partitionSpec: Seq[Expression],
+orderSpec: Seq[SortOrder],
+child: SparkPlan) extends UnaryExecNode {
+
+  override def output: Seq[Attribute] =
+child.output ++ windowExpression.map(_.toAttribute)
+
+  override def requiredChildDistribution: Seq[Distribution] = {
+if (partitionSpec.isEmpty) {
+  // Only show warning when the number of bytes is larger than 100 MB?
+  logWarning("No Partition Defined for Window operation! Moving all 
data to a single "
++ "partition, this can cause serious performance degradation.")
+  AllTuples :: Nil
+} else ClusteredDistribution(partitionSpec) :: Nil
--- End diff --

nit: I would do 

```
else {

}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21082: [SPARK-22239][SQL][Python] Enable grouped aggrega...

2018-04-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21082#discussion_r184875053
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
@@ -424,6 +424,21 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
 }
   }
 
+  object Window extends Strategy {
+def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
+  case PhysicalWindow(
+  WindowFunctionType.SQL, windowExprs, partitionSpec, orderSpec, 
child) =>
--- End diff --

nit: indent


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21082: [SPARK-22239][SQL][Python] Enable grouped aggrega...

2018-04-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21082#discussion_r184875039
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -624,7 +624,9 @@ object CollapseRepartition extends Rule[LogicalPlan] {
 object CollapseWindow extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
 case w1 @ Window(we1, ps1, os1, w2 @ Window(we2, ps2, os2, grandChild))
-if ps1 == ps2 && os1 == os2 && 
w1.references.intersect(w2.windowOutputSet).isEmpty =>
+if ps1 == ps2 && os1 == os2 && 
w1.references.intersect(w2.windowOutputSet).isEmpty
+ && WindowFunctionType.functionType(we1.head) ==
--- End diff --

nit: indent


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21082: [SPARK-22239][SQL][Python] Enable grouped aggrega...

2018-04-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21082#discussion_r184875000
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -112,12 +113,19 @@ trait CheckAnalysis extends PredicateHelper {
 failAnalysis("An offset window function can only be evaluated 
in an ordered " +
   s"row-based window frame with a single offset: $w")
 
+  case w @ WindowExpression(_: PythonUDF,
+  WindowSpecDefinition(_, _, frame: SpecifiedWindowFrame))
--- End diff --

indentation :-)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21191: [MINOR][DOCS] Fix a broken link for Arrow's supported ty...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21191
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21191: [MINOR][DOCS] Fix a broken link for Arrow's supported ty...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21191
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2744/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21191: [MINOR][DOCS] Fix a broken link for Arrow's supported ty...

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21191
  
**[Test build #89968 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89968/testReport)**
 for PR 21191 at commit 
[`b825d8a`](https://github.com/apache/spark/commit/b825d8a0439d6dce6898dbe4e50377e277f35a3f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21082: [SPARK-22239][SQL][Python] Enable grouped aggrega...

2018-04-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21082#discussion_r184874826
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2301,10 +2301,12 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
The returned scalar can be either a python primitive type, e.g., 
`int` or `float`
or a numpy data type, e.g., `numpy.int64` or `numpy.float64`.
 
-   :class:`ArrayType`, :class:`MapType` and :class:`StructType` are 
currently not supported as
-   output types.
+   :class:`MapType` and :class:`StructType` are currently not 
supported as output types.
--- End diff --

@icexelloss, actually should we keep this note? I think this is matched 
with 
https://spark.apache.org/docs/latest/sql-programming-guide.html#supported-sql-types
 which we documented there and SQLConf.

Probably, just leaving a link could be fine. Removing out is okay to me too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21191: [MINOR][DOCS] Fix a broken link for Arrow's supported ty...

2018-04-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21191
  
I manually skimmed and look through all other links and checked working 
fine in this page (except APIs since I built it by `SKIP_API=1 jekyll watch`).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21191: [MINOR][DOCS] Fix a broken link for Arrow's suppo...

2018-04-29 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/21191

[MINOR][DOCS] Fix a broken link for Arrow's supported types in the 
programming guide

## What changes were proposed in this pull request?

This PR fixes a broken link for Arrow's supported types in the programming 
guide.

## How was this patch tested?

Manually tested via `SKIP_API=1 jekyll watch`. 

"Supported SQL Types" here in 
https://spark.apache.org/docs/latest/sql-programming-guide.html#enabling-for-conversion-tofrom-pandas
 is broken. It should be 
https://spark.apache.org/docs/latest/sql-programming-guide.html#supported-sql-types

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark minor-arrow-link

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21191.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21191


commit b825d8a0439d6dce6898dbe4e50377e277f35a3f
Author: hyukjinkwon 
Date:   2018-04-29T08:07:24Z

Fix a broken link for Arrow's supported types in the programming guide




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-29 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184874390
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,205 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+@inherit_doc
+class PowerIterationClustering(HasMaxIter, HasPredictionCol, 
JavaTransformer, JavaParams,
+   JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+Power Iteration Clustering (PIC), a scalable graph clustering 
algorithm developed by
+http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From 
the abstract:
+PIC finds a very low-dimensional embedding of a dataset using 
truncated power
+iteration on a normalized pair-wise similarity matrix of the data.
+
+PIC takes an affinity matrix between items (or vertices) as input.  An 
affinity matrix
+is a symmetric matrix whose entries are non-negative similarities 
between items.
+PIC takes this matrix (or graph) as an adjacency matrix.  
Specifically, each input row
+includes:
+
+ - :py:attr:`idCol`: vertex ID
+ - :py:attr:`neighborsCol`: neighbors of vertex in :py:attr:`idCol`
+ - :py:attr:`similaritiesCol`: non-negative weights (similarities) of 
edges between the
+vertex in :py:attr:`idCol` and each neighbor in 
:py:attr:`neighborsCol`
+
+PIC returns a cluster assignment for each input vertex.  It appends a 
new column
+:py:attr:`predictionCol` containing the cluster assignment in 
:py:attr:`[0,k)` for
+each row (vertex).
+
+.. note::
+
+ - [[PowerIterationClustering]] is a transformer with an expensive 
[[transform]] operation.
+Transform runs the iterative PIC algorithm to cluster the whole 
input dataset.
+ - Input validation: This validates that similarities are non-negative 
but does NOT validate
+that the input matrix is symmetric.
+
+.. seealso:: http://en.wikipedia.org/wiki/Spectral_clustering>
+Spectral clustering (Wikipedia)
--- End diff --

You can check other places using `seealso`:

```python
.. seealso:: `Spectral clustering \
`_
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20959
  
**[Test build #89967 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89967/testReport)**
 for PR 20959 at commit 
[`d4d9d65`](https://github.com/apache/spark/commit/d4d9d65ce28c4176c085449564c8e5f8ec0b3ff7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21133: [SPARK-24013][SQL] Remove unneeded compress in Approxima...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21133
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2743/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

2018-04-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20959
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21133: [SPARK-24013][SQL] Remove unneeded compress in Approxima...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21133
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21133: [SPARK-24013][SQL] Remove unneeded compress in Approxima...

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21133
  
**[Test build #89966 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89966/testReport)**
 for PR 21133 at commit 
[`d47d9bd`](https://github.com/apache/spark/commit/d47d9bdf564909a6fc8d3bd67cf696b7c1cf0d4b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20959
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20959
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89964/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20959
  
**[Test build #89964 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89964/testReport)**
 for PR 20959 at commit 
[`d4d9d65`](https://github.com/apache/spark/commit/d4d9d65ce28c4176c085449564c8e5f8ec0b3ff7).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21119
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21119
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89965/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21119
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC

2018-04-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21119
  
**[Test build #89965 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89965/testReport)**
 for PR 21119 at commit 
[`c25d3dc`](https://github.com/apache/spark/commit/c25d3dcb11eff13bfe1092e1dc64c035335b852b).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC

2018-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21119
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2742/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 102 matches

Mail list logo