[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19439
  
**[Test build #82481 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82481/testReport)**
 for PR 19439 at commit 
[`0e47b6c`](https://github.com/apache/spark/commit/0e47b6c906afa1589bcb3ee9af87b4833f90be64).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19392: [SPARK-22169][SQL] support byte length literal as identi...

2017-10-05 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19392
  
OK


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...

2017-10-05 Thread shubhamchopra
Github user shubhamchopra commented on the issue:

https://github.com/apache/spark/pull/17673
  
Thanks for your comments/suggestions @MLnick and @sethah . Working on 
incorporating these.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19439
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19439
  
**[Test build #82480 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82480/testReport)**
 for PR 19439 at commit 
[`22baf02`](https://github.com/apache/spark/commit/22baf022b2f109bb1f5eba0b13ea34de894cd14c).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class SamplePathFilter extends Configured with PathFilter `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19439
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82480/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19439
  
**[Test build #82480 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82480/testReport)**
 for PR 19439 at commit 
[`22baf02`](https://github.com/apache/spark/commit/22baf022b2f109bb1f5eba0b13ea34de894cd14c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19438: [SPARK-22208] [SQL] Improve percentile_approx by ...

2017-10-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19438#discussion_r143001025
  
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -2538,7 +2538,7 @@ test_that("describe() and summary() on a DataFrame", {
 
   stats2 <- summary(df)
   expect_equal(collect(stats2)[5, "summary"], "25%")
-  expect_equal(collect(stats2)[5, "age"], "30")
+  expect_equal(collect(stats2)[5, "age"], "19")
--- End diff --

Also looks more logical given the input contains values 19 and 30 only.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19438: [SPARK-22208] [SQL] Improve percentile_approx by ...

2017-10-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19438#discussion_r143000567
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala ---
@@ -43,7 +43,7 @@ class ImputerSuite extends SparkFunSuite with 
MLlibTestSparkContext with Default
   (0, 1.0, 1.0, 1.0),
   (1, 3.0, 3.0, 3.0),
   (2, Double.NaN, Double.NaN, Double.NaN),
-  (3, -1.0, 2.0, 3.0)
+  (3, -1.0, 2.0, 1.0)
--- End diff --

Did this have to change as a result? just checking it's intentional


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19438: [SPARK-22208] [SQL] Improve percentile_approx by ...

2017-10-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19438#discussion_r142999631
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/ApproximatePercentileQuerySuite.scala
 ---
@@ -129,7 +144,7 @@ class ApproximatePercentileQuerySuite extends QueryTest 
with SharedSQLContext {
 withTempView(table) {
   (1 to 1000).toDF("col").createOrReplaceTempView(table)
   checkAnswer(
-spark.sql(s"SELECT percentile_approx(col, array(0.25 + 0.25D), 200 
+ 800D) FROM $table"),
+spark.sql(s"SELECT percentile_approx(col, array(0.25 + 0.25D), 200 
+ 8000D) FROM $table"),
--- End diff --

I recall that without the change the answer was "499", which is also really 
close, so I think this is fine.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19438: [SPARK-22208] [SQL] Improve percentile_approx by ...

2017-10-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19438#discussion_r143000448
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1038,8 +1038,8 @@ def summary(self, *statistics):
 |   mean|   3.5| null|
 | stddev|2.1213203435596424| null|
 |min| 2|Alice|
-|25%| 5| null|
-|50%| 5| null|
+|25%| 2| null|
--- End diff --

Although this looks like a big change, the test data set has only two data 
elements, with values 2 and 5, so these are pretty equally valid. It's probably 
more logical that the 25% percentile is 2 if 75% is 5.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-05 Thread imatiach-msft
GitHub user imatiach-msft opened a pull request:

https://github.com/apache/spark/pull/19439

[SPARK-21866][ML][PySpark] Adding spark image reader

## What changes were proposed in this pull request?
Adding spark image reader, an implementation of schema for representing 
images in spark DataFrames

The code is taken from the spark package located here:
(https://github.com/Microsoft/spark-images)

Please see the JIRA for more information 
(https://issues.apache.org/jira/browse/SPARK-21866)

Please see mailing list for SPIP vote and approval information:

(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)

# Background and motivation
As Apache Spark is being used more and more in the industry, some new use 
cases are emerging for different data formats beyond the traditional SQL types 
or the numerical types (vectors and matrices). Deep Learning applications 
commonly deal with image processing. A number of projects add some Deep 
Learning capabilities to Spark (see list below), but they struggle to 
communicate with each other or with MLlib pipelines because there is no 
standard way to represent an image in Spark DataFrames. We propose to federate 
efforts for representing images in Spark by defining a representation that 
caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames 
and Datasets (based on existing industrial standards), and an interface for 
loading sources of images. It is not meant to be a full-fledged image 
processing library, but rather the core description that other libraries and 
users can rely on. Several packages already offer various processing facilities 
for transforming images or doing more complex operations, and each has various 
design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, 
which have been testing this design in two open source packages: MMLSpark and 
Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that 
targets low-level applications. It is significantly more liberal in memory 
usage than compressed image representations such as JPEG, PNG, etc., but it 
allows easy communication with popular image processing libraries and has no 
decoding overhead.

## How was this patch tested?

Unit tests in scala ImageSchemaSuite, unit tests in python

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/imatiach-msft/spark ilmat/spark-images

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19439.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19439


commit 22baf022b2f109bb1f5eba0b13ea34de894cd14c
Author: Ilya Matiach 
Date:   2017-10-04T21:10:26Z

[SPARK-21866][ML][PySpark] Adding spark image reader




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19420: [SPARK-22191] [SQL] Add hive serde example with s...

2017-10-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19420#discussion_r142999706
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/sql/hive/JavaSparkHiveExample.java
 ---
@@ -124,6 +124,13 @@ public static void main(String[] args) {
 // ...
 // $example off:spark_hive$
 
+// Hive serde's are also supported with serde properties.
+   String sqlQuery = "CREATE TABLE src_serde(key decimal(38,18), value 
int) USING hive"
--- End diff --

Hi, @crlalam.
We use 2-space indentation in general.
FYI, maybe, you can see [Scala Coding 
Style](https://github.com/databricks/scala-style-guide).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19061: [SPARK-21568][CORE] ConsoleProgressBar should only be en...

2017-10-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19061
  
Could you review this `ConsoleProgressBar` PR again, @vanzin ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18732
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18732
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82477/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18732
  
**[Test build #82477 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82477/testReport)**
 for PR 18732 at commit 
[`f572385`](https://github.com/apache/spark/commit/f572385e28a1ccd2f8663adf64910d5f0a0ce67c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Wor...

2017-10-05 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/17673#discussion_r142991123
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -171,20 +210,46 @@ final class Word2Vec @Since("1.4.0") (
   @Since("2.0.0")
   def setMaxSentenceLength(value: Int): this.type = set(maxSentenceLength, 
value)
 
+  /** @group setParam */
+  @Since("2.2.0")
+  val solvers = Set("sg-hs", "cbow-ns")
--- End diff --

Yeah, for reference you can just look at how linear regression does the 
`supportedSolvers`. Also, the require isn't necessary, you can just use 
`ParamValidators.inArray[String](supportedSolvers))`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Wor...

2017-10-05 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/17673#discussion_r142990145
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -171,20 +210,46 @@ final class Word2Vec @Since("1.4.0") (
   @Since("2.0.0")
   def setMaxSentenceLength(value: Int): this.type = set(maxSentenceLength, 
value)
 
+  /** @group setParam */
+  @Since("2.2.0")
+  val solvers = Set("sg-hs", "cbow-ns")
--- End diff --

"skipgram-hierarchical softmax"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17357: [SPARK-20025][CORE] Ignore SPARK_LOCAL* env, while deplo...

2017-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17357
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17357: [SPARK-20025][CORE] Ignore SPARK_LOCAL* env, while deplo...

2017-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17357
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82476/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17357: [SPARK-20025][CORE] Ignore SPARK_LOCAL* env, while deplo...

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17357
  
**[Test build #82476 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82476/testReport)**
 for PR 17357 at commit 
[`b188cc9`](https://github.com/apache/spark/commit/b188cc9a9e290683210d3c4a6841d37ca00b112f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17774: [SPARK-18371][Streaming] Spark Streaming backpressure ge...

2017-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17774
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-05 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19406
  
@HyukjinKwon thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19406: [SPARK-22179] percentile_approx should choose the...

2017-10-05 Thread wzhfy
Github user wzhfy closed the pull request at:

https://github.com/apache/spark/pull/19406


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19438: [SPARK-22208] [SQL] Improve percentile_approx by ...

2017-10-05 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19438#discussion_r142981865
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/ApproximatePercentileQuerySuite.scala
 ---
@@ -129,7 +144,7 @@ class ApproximatePercentileQuerySuite extends QueryTest 
with SharedSQLContext {
 withTempView(table) {
   (1 to 1000).toDF("col").createOrReplaceTempView(table)
   checkAnswer(
-spark.sql(s"SELECT percentile_approx(col, array(0.25 + 0.25D), 200 
+ 800D) FROM $table"),
+spark.sql(s"SELECT percentile_approx(col, array(0.25 + 0.25D), 200 
+ 8000D) FROM $table"),
--- End diff --

here, fix the test case by increasing accuracy


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19438: [SPARK-22208] [SQL] Improve percentile_approx by not rou...

2017-10-05 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19438
  
cc @srowen @jiangxb1987 @HyukjinKwon 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19406
  
Ah, that's fine :). It was just an option. I will follow discussion and 
help sort it out in any event.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-05 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19406
  
@HyukjinKwon These two JIRAs change percentile_approx in different ways, so 
maybe it's better to use different JIRAs?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-05 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19406
  
@HyukjinKwon uh...just saw this, already created a new [JIRA](url) and 
[PR](https://github.com/apache/spark/pull/19438), is it also ok?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19438: [SPARK-22208] [SQL] Improve percentile_approx by not rou...

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19438
  
**[Test build #82479 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82479/testReport)**
 for PR 19438 at commit 
[`f2b1538`](https://github.com/apache/spark/commit/f2b153800ebdf10999d4a8bb3578101a12f6d631).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19438: [SPARK-22208] [SQL] Improve percentile_approx by ...

2017-10-05 Thread wzhfy
GitHub user wzhfy opened a pull request:

https://github.com/apache/spark/pull/19438

[SPARK-22208] [SQL] Improve percentile_approx by not rounding up 
targetError and starting from index 0

## What changes were proposed in this pull request?

Currently percentile_approx never returns the first element when percentile 
is in (relativeError, 1/N], where relativeError default 1/1, and N is the 
total number of elements. But ideally, percentiles in [0, 1/N] should all 
return the first element as the answer.

For example, given input data 1 to 10, if a user queries 10% (or even less) 
percentile, it should return 1, because the first value 1 already reaches 10%. 
Currently it returns 2.

Based on the paper, targetError is not rounded up, and searching index 
should start from 0 instead of 1. By following the paper, we should be able to 
fix the cases mentioned above.

## How was this patch tested?

Added a new test case and fix existing test cases.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wzhfy/spark improve_percentile_approx

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19438.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19438


commit 24f8295498a7ad6d2d99ea27a196ccf154165907
Author: Zhenhua Wang 
Date:   2017-09-30T16:04:32Z

return the first element for small percentage

commit 8c8c22dbebe99def6127b49988dfc4f886797bd6
Author: Zhenhua Wang 
Date:   2017-10-02T10:24:28Z

fix test

commit dbc3d47b0a56113032d2a4565180932e4ef26219
Author: Zhenhua Wang 
Date:   2017-10-02T14:53:04Z

fix test

commit 9815ce8e17e34422f8c915d115061a9635abd119
Author: Zhenhua Wang 
Date:   2017-10-03T14:51:55Z

fix pyspark test

commit f2b153800ebdf10999d4a8bb3578101a12f6d631
Author: Zhenhua Wang 
Date:   2017-10-05T15:47:27Z

follow the paper and fix sparkR test




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19406
  
Oh, optionally, we can just edit the JIRA I guess.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-05 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19406
  
@srowen @jiangxb1987 OK, I'll close this JIRA and creating a new JIRA as 
improvement instead of bugfix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19090: [SPARK-21877][DEPLOY, WINDOWS] Handle quotes in Windows ...

2017-10-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19090
  
@felixcheung, this one LGTM as I checked what I could all and quite 
confident; however, will leave this open for few days more considering 
importance. Let me please cc you here to double check when you have some times 
or leave some comments if you have some concerns.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19041: [SPARK-21097][CORE] Add option to recover cached data

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19041
  
**[Test build #82478 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82478/testReport)**
 for PR 19041 at commit 
[`985874d`](https://github.com/apache/spark/commit/985874da9f72a942d1a28f413167ab3b7fcc64e6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-05 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142961120
  
--- Diff: python/pyspark/worker.py ---
@@ -74,17 +74,35 @@ def wrap_udf(f, return_type):
 
 
 def wrap_pandas_udf(f, return_type):
-arrow_return_type = toArrowType(return_type)
-
-def verify_result_length(*a):
-result = f(*a)
-if not hasattr(result, "__len__"):
-raise TypeError("Return type of pandas_udf should be a 
Pandas.Series")
-if len(result) != len(a[0]):
-raise RuntimeError("Result vector from pandas_udf was not the 
required length: "
-   "expected %d, got %d" % (len(a[0]), 
len(result)))
-return result
-return lambda *a: (verify_result_length(*a), arrow_return_type)
+if isinstance(return_type, StructType):
--- End diff --

Yes will do.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-05 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/18732
  
@HyukjinKwon Thanks for the summarry!

*  https://github.com/apache/spark/pull/18732#discussion_r142735696 
`ArrowPandasSerialzer`I will spend some time address this today.
* https://github.com/apache/spark/pull/18732#issuecomment-333065737 
(Breaking into two pandas udf API) I think is addressed here 
https://github.com/apache/spark/pull/18732#discussion_r141830344. But I am 
happy to discuss more.
* https://github.com/apache/spark/pull/18732#issuecomment-26073 (API 
Naming) I will wait on feedback here.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19369: [SPARK-22147][CORE] Removed redundant allocations from B...

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19369
  
**[Test build #3942 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3942/testReport)**
 for PR 19369 at commit 
[`d996c28`](https://github.com/apache/spark/commit/d996c283602269afd05dffad1e681f47f7baf47f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19399: [SPARK-22175][WEB-UI] Add status column to histor...

2017-10-05 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/19399#discussion_r142959826
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -850,6 +869,18 @@ private[history] class AppListingListener(log: 
FileStatus, clock: Clock) extends
 fileSize)
 }
 
+def applicationStatus : Option[String] = {
+  if (startTime.getTime == -1) {
+Some("")
+  } else if (endTime.getTime == -1) {
+Some("")
+  } else if (jobToStatus.isEmpty || jobToStatus.exists(_._2 != 
"Succeeded")) {
--- End diff --

also, I dunno if this criteria is even accurate.  You could have a 
successful app that doesn't run any jobs -- eg., its kicked off by cron 
regularly, and then it checks some metadata to see if any work needs to be 
done, and if not, it just quits.  Doesn't seem right to call it "failed".  

In progress is also tricky, as the app may have been killed without endTime 
getting written.

Anyway, I guess this is OK, just pointing out some reasons why this can be 
misleading.  In particular, I think it would be nicer if spark actually logged 
whether or not the app was successful.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-05 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19436
  
Thanks @HyukjinKwon @felixcheung 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-05 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142957552
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2058,7 +2058,7 @@ def __init__(self, func, returnType, name=None, 
vectorized=False):
 self._name = name or (
 func.__name__ if hasattr(func, '__name__')
 else func.__class__.__name__)
-self._vectorized = vectorized
+self.vectorized = vectorized
--- End diff --

Are we ok with having `vectorized` being public field? I am fine with both 
public or private but I do think the fields of the function returned by 
`UserDefinedFuncion_wrapped()` should have the same field names as 
`UserDefinedFunction` to avoid confusion.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work...

2017-10-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19436


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19436
  
Merged to master, branch-2.2 and branch-2.1.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-05 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142956597
  
--- Diff: python/pyspark/sql/group.py ---
@@ -194,6 +194,65 @@ def pivot(self, pivot_col, values=None):
 jgd = self._jgd.pivot(pivot_col, values)
 return GroupedData(jgd, self.sql_ctx)
 
+def apply(self, udf):
+"""
+Maps each group of the current :class:`DataFrame` using a pandas 
udf and returns the result
+as a :class:`DataFrame`.
+
+The user-function should take a `pandas.DataFrame` and return 
another `pandas.DataFrame`.
+Each group is passed as a `pandas.DataFrame` to the user-function 
and the returned
+`pandas.DataFrame` are combined as a :class:`DataFrame`. The 
returned `pandas.DataFrame`
+can be arbitrary length and its schema should match the returnType 
of the pandas udf.
+
+:param udf: A wrapped function returned by `pandas_udf`
+
+>>> df = spark.createDataFrame(
+... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
+... ("id", "v"))
+>>> @pandas_udf(returnType=df.schema)
+... def normalize(pdf):
+... v = pdf.v
+... return pdf.assign(v=(v - v.mean()) / v.std())
+>>> df.groupby('id').apply(normalize).show() # doctest: + SKIP
++---+---+
+| id|  v|
++---+---+
+|  1|-0.7071067811865475|
+|  1| 0.7071067811865475|
+|  2|-0.8320502943378437|
+|  2|-0.2773500981126146|
+|  2| 1.1094003924504583|
++---+---+
+
+.. seealso:: :meth:`pyspark.sql.functions.pandas_udf`
+
+"""
+from pyspark.sql.functions import pandas_udf
+
+# Columns are special because hasattr always return True
+if isinstance(udf, Column) or not hasattr(udf, 'func') or not 
udf.vectorized:
+raise ValueError("The argument to apply must be a pandas_udf")
+if not isinstance(udf.returnType, StructType):
+raise ValueError("The returnType of the pandas_udf must be a 
StructType")
+
+df = DataFrame(self._jgd.df(), self.sql_ctx)
+func = udf.func
+returnType = udf.returnType
+
+# The python executors expects the function to take a list of 
pd.Series as input
+# So we to create a wrapper function that turns that to a 
pd.DataFrame before passing
+# down to the user function
+columns = df.columns
+
+def wrapped(*cols):
+import pandas as pd
+return func(pd.concat(cols, axis=1, keys=columns))
--- End diff --

@BryanCutler yeah I was trying to do that earlier  that but unfortunately 
the column names are lost on the worker so we cannot construct the 
`Pandas.DataFrame` on the worker.

I think the best plcae to define the wrap function is probably on the 
pyspark driver side because we have the most information there. However, that 
requires some refactoring. I will give it a try and see how that goes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142952213
  
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None):
 jgd = self._jgd.pivot(pivot_col)
 else:
 jgd = self._jgd.pivot(pivot_col, values)
-return GroupedData(jgd, self.sql_ctx)
+return GroupedData(jgd, self._df)
+
+def apply(self, udf):
+"""
+Maps each group of the current :class:`DataFrame` using a pandas 
udf and returns the result
+as a :class:`DataFrame`.
+
+The user-defined function should take a `pandas.DataFrame` and 
return another
+`pandas.DataFrame`. Each group is passed as a `pandas.DataFrame` 
to the user-function and
+the returned`pandas.DataFrame` are combined as a 
:class:`DataFrame`. The returned
+`pandas.DataFrame` can be arbitrary length and its schema should 
match the returnType of
+the pandas udf.
+
+:param udf: A wrapped function returned by `pandas_udf`
+
+>>> df = spark.createDataFrame(
+... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
+... ("id", "v"))
+>>> @pandas_udf(returnType=df.schema)
+... def normalize(pdf):
+... v = pdf.v
+... return pdf.assign(v=(v - v.mean()) / v.std())
+>>> df.groupby('id').apply(normalize).show()  # doctest: +SKIP
--- End diff --

Not sure.. I think what you know is what I usually do.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18732
  
Ongoing discussions that (I think) might block this PR:

- https://github.com/apache/spark/pull/18732#discussion_r142735696 by 
@BryanCutler: `ArrowPandasSerializer` able to serialize pandas.DataFrames

- https://github.com/apache/spark/pull/18732#issuecomment-333065737 by 
@viirya: breaking this definition into two (groupping and normal udfs).

- https://github.com/apache/spark/pull/18732#issuecomment-26073 by 
@rxin and answer 
https://github.com/apache/spark/pull/18732#issuecomment-333432266 by 
@icexelloss: naming suggestion



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18732
  
**[Test build #82477 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82477/testReport)**
 for PR 18732 at commit 
[`f572385`](https://github.com/apache/spark/commit/f572385e28a1ccd2f8663adf64910d5f0a0ce67c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-05 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142949557
  
--- Diff: python/pyspark/worker.py ---
@@ -74,17 +74,35 @@ def wrap_udf(f, return_type):
 
 
 def wrap_pandas_udf(f, return_type):
-arrow_return_type = toArrowType(return_type)
-
-def verify_result_length(*a):
-result = f(*a)
-if not hasattr(result, "__len__"):
-raise TypeError("Return type of pandas_udf should be a 
Pandas.Series")
-if len(result) != len(a[0]):
-raise RuntimeError("Result vector from pandas_udf was not the 
required length: "
-   "expected %d, got %d" % (len(a[0]), 
len(result)))
-return result
-return lambda *a: (verify_result_length(*a), arrow_return_type)
+if isinstance(return_type, StructType):
+arrow_return_types = [to_arrow_type(field.dataType) for field in 
return_type]
+
+def fn(*a):
--- End diff --

Yes, I will change the name to some thing more descriptive. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142949179
  
--- Diff: python/pyspark/worker.py ---
@@ -74,17 +74,35 @@ def wrap_udf(f, return_type):
 
 
 def wrap_pandas_udf(f, return_type):
-arrow_return_type = toArrowType(return_type)
-
-def verify_result_length(*a):
-result = f(*a)
-if not hasattr(result, "__len__"):
-raise TypeError("Return type of pandas_udf should be a 
Pandas.Series")
-if len(result) != len(a[0]):
-raise RuntimeError("Result vector from pandas_udf was not the 
required length: "
-   "expected %d, got %d" % (len(a[0]), 
len(result)))
-return result
-return lambda *a: (verify_result_length(*a), arrow_return_type)
+if isinstance(return_type, StructType):
--- End diff --

Yea, let's add some comments and throws a better exception. For example, I 
think we should clarify `StructType` should be used in groupping udf only in 
the exception message.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-05 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142948551
  
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None):
 jgd = self._jgd.pivot(pivot_col)
 else:
 jgd = self._jgd.pivot(pivot_col, values)
-return GroupedData(jgd, self.sql_ctx)
+return GroupedData(jgd, self._df)
+
+def apply(self, udf):
+"""
+Maps each group of the current :class:`DataFrame` using a pandas 
udf and returns the result
+as a :class:`DataFrame`.
+
+The user-defined function should take a `pandas.DataFrame` and 
return another
+`pandas.DataFrame`. Each group is passed as a `pandas.DataFrame` 
to the user-function and
+the returned`pandas.DataFrame` are combined as a 
:class:`DataFrame`. The returned
+`pandas.DataFrame` can be arbitrary length and its schema should 
match the returnType of
+the pandas udf.
+
+:param udf: A wrapped function returned by `pandas_udf`
+
+>>> df = spark.createDataFrame(
+... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
+... ("id", "v"))
+>>> @pandas_udf(returnType=df.schema)
+... def normalize(pdf):
+... v = pdf.v
+... return pdf.assign(v=(v - v.mean()) / v.std())
+>>> df.groupby('id').apply(normalize).show()  # doctest: +SKIP
--- End diff --

Ahh..Thanks! Will give it a try.

Still, is there a easier way to run the pyspark tests locally (the way 
jenkins runs them)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142948307
  
--- Diff: python/pyspark/worker.py ---
@@ -74,17 +74,35 @@ def wrap_udf(f, return_type):
 
 
 def wrap_pandas_udf(f, return_type):
-arrow_return_type = toArrowType(return_type)
-
-def verify_result_length(*a):
-result = f(*a)
-if not hasattr(result, "__len__"):
-raise TypeError("Return type of pandas_udf should be a 
Pandas.Series")
-if len(result) != len(a[0]):
-raise RuntimeError("Result vector from pandas_udf was not the 
required length: "
-   "expected %d, got %d" % (len(a[0]), 
len(result)))
-return result
-return lambda *a: (verify_result_length(*a), arrow_return_type)
+if isinstance(return_type, StructType):
+arrow_return_types = [to_arrow_type(field.dataType) for field in 
return_type]
+
+def fn(*a):
--- End diff --

Yea, but `fn` looks a no-no .. do you maybe have an idea about a better 
name?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142947514
  
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None):
 jgd = self._jgd.pivot(pivot_col)
 else:
 jgd = self._jgd.pivot(pivot_col, values)
-return GroupedData(jgd, self.sql_ctx)
+return GroupedData(jgd, self._df)
+
+def apply(self, udf):
+"""
+Maps each group of the current :class:`DataFrame` using a pandas 
udf and returns the result
+as a :class:`DataFrame`.
+
+The user-defined function should take a `pandas.DataFrame` and 
return another
+`pandas.DataFrame`. Each group is passed as a `pandas.DataFrame` 
to the user-function and
+the returned`pandas.DataFrame` are combined as a 
:class:`DataFrame`. The returned
+`pandas.DataFrame` can be arbitrary length and its schema should 
match the returnType of
+the pandas udf.
+
+:param udf: A wrapped function returned by `pandas_udf`
+
+>>> df = spark.createDataFrame(
+... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
+... ("id", "v"))
+>>> @pandas_udf(returnType=df.schema)
+... def normalize(pdf):
+... v = pdf.v
+... return pdf.assign(v=(v - v.mean()) / v.std())
+>>> df.groupby('id').apply(normalize).show()  # doctest: +SKIP
--- End diff --

Also, it looks this file does not define `spark` as a global that is used 
in doctests. I think we should add something like ...

```diff
  sc = spark.sparkContext
  globs['sc'] = sc
+ globs['spark'] = spark 
  globs['df'] = sc.parallelize([(2, 'Alice'), (5, 'Bob')]) \
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142946504
  
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None):
 jgd = self._jgd.pivot(pivot_col)
 else:
 jgd = self._jgd.pivot(pivot_col, values)
-return GroupedData(jgd, self.sql_ctx)
+return GroupedData(jgd, self._df)
+
+def apply(self, udf):
+"""
+Maps each group of the current :class:`DataFrame` using a pandas 
udf and returns the result
+as a :class:`DataFrame`.
+
+The user-defined function should take a `pandas.DataFrame` and 
return another
+`pandas.DataFrame`. Each group is passed as a `pandas.DataFrame` 
to the user-function and
+the returned`pandas.DataFrame` are combined as a 
:class:`DataFrame`. The returned
+`pandas.DataFrame` can be arbitrary length and its schema should 
match the returnType of
+the pandas udf.
+
+:param udf: A wrapped function returned by `pandas_udf`
+
+>>> df = spark.createDataFrame(
+... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
+... ("id", "v"))
+>>> @pandas_udf(returnType=df.schema)
+... def normalize(pdf):
+... v = pdf.v
+... return pdf.assign(v=(v - v.mean()) / v.std())
+>>> df.groupby('id').apply(normalize).show()  # doctest: +SKIP
--- End diff --

Probably, importing `pandas_udf` should solve the problem I guess.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142946430
  
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None):
 jgd = self._jgd.pivot(pivot_col)
 else:
 jgd = self._jgd.pivot(pivot_col, values)
-return GroupedData(jgd, self.sql_ctx)
+return GroupedData(jgd, self._df)
+
+def apply(self, udf):
+"""
+Maps each group of the current :class:`DataFrame` using a pandas 
udf and returns the result
+as a :class:`DataFrame`.
+
+The user-defined function should take a `pandas.DataFrame` and 
return another
+`pandas.DataFrame`. Each group is passed as a `pandas.DataFrame` 
to the user-function and
+the returned`pandas.DataFrame` are combined as a 
:class:`DataFrame`. The returned
+`pandas.DataFrame` can be arbitrary length and its schema should 
match the returnType of
+the pandas udf.
+
+:param udf: A wrapped function returned by `pandas_udf`
+
+>>> df = spark.createDataFrame(
+... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
+... ("id", "v"))
+>>> @pandas_udf(returnType=df.schema)
+... def normalize(pdf):
+... v = pdf.v
+... return pdf.assign(v=(v - v.mean()) / v.std())
+>>> df.groupby('id').apply(normalize).show()  # doctest: +SKIP
--- End diff --

I think the problem is, `pandas_udf` is unimportable in this doctest. Up to 
my knowledge, `# doctest: +SKIP` is per line.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-05 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142945465
  
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None):
 jgd = self._jgd.pivot(pivot_col)
 else:
 jgd = self._jgd.pivot(pivot_col, values)
-return GroupedData(jgd, self.sql_ctx)
+return GroupedData(jgd, self._df)
+
+def apply(self, udf):
+"""
+Maps each group of the current :class:`DataFrame` using a pandas 
udf and returns the result
+as a :class:`DataFrame`.
+
+The user-defined function should take a `pandas.DataFrame` and 
return another
+`pandas.DataFrame`. Each group is passed as a `pandas.DataFrame` 
to the user-function and
+the returned`pandas.DataFrame` are combined as a 
:class:`DataFrame`. The returned
+`pandas.DataFrame` can be arbitrary length and its schema should 
match the returnType of
+the pandas udf.
+
+:param udf: A wrapped function returned by `pandas_udf`
+
+>>> df = spark.createDataFrame(
+... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
+... ("id", "v"))
+>>> @pandas_udf(returnType=df.schema)
+... def normalize(pdf):
+... v = pdf.v
+... return pdf.assign(v=(v - v.mean()) / v.std())
+>>> df.groupby('id').apply(normalize).show()  # doctest: +SKIP
--- End diff --

I have been using 
```
bin/pyspark pyspark.sql.tests GroupbyApplyTests
```

But this doesn't seem to do doctest.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-05 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142944123
  
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None):
 jgd = self._jgd.pivot(pivot_col)
 else:
 jgd = self._jgd.pivot(pivot_col, values)
-return GroupedData(jgd, self.sql_ctx)
+return GroupedData(jgd, self._df)
+
+def apply(self, udf):
+"""
+Maps each group of the current :class:`DataFrame` using a pandas 
udf and returns the result
+as a :class:`DataFrame`.
+
+The user-defined function should take a `pandas.DataFrame` and 
return another
+`pandas.DataFrame`. Each group is passed as a `pandas.DataFrame` 
to the user-function and
+the returned`pandas.DataFrame` are combined as a 
:class:`DataFrame`. The returned
+`pandas.DataFrame` can be arbitrary length and its schema should 
match the returnType of
+the pandas udf.
+
+:param udf: A wrapped function returned by `pandas_udf`
+
+>>> df = spark.createDataFrame(
+... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
+... ("id", "v"))
+>>> @pandas_udf(returnType=df.schema)
+... def normalize(pdf):
+... v = pdf.v
+... return pdf.assign(v=(v - v.mean()) / v.std())
+>>> df.groupby('id').apply(normalize).show()  # doctest: +SKIP
--- End diff --

Seems this is still not skipped by doc test.

What's the best way to run pyspark test locally?

I tried 

```
./run-tests --modules=pyspark-sql --parallelism=4
```

But it's giving me a different failure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19389: [SPARK-22165][SQL] Resolve type conflicts between decima...

2017-10-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19389
  
ping @gatorsmile


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19436
  
**[Test build #82474 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82474/testReport)**
 for PR 19436 at commit 
[`71bf813`](https://github.com/apache/spark/commit/71bf813a4375a5736f903bffb3b17a29d2928d56).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19436
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19436
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82474/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17357: [SPARK-20025][CORE] Ignore SPARK_LOCAL* env, while deplo...

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17357
  
**[Test build #82476 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82476/testReport)**
 for PR 17357 at commit 
[`b188cc9`](https://github.com/apache/spark/commit/b188cc9a9e290683210d3c4a6841d37ca00b112f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-05 Thread susanxhuynh
Github user susanxhuynh commented on the issue:

https://github.com/apache/spark/pull/19437
  
@ArtRand @skonto Please review. Tests passed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19437
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82475/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19437
  
**[Test build #82475 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82475/testReport)**
 for PR 19437 at commit 
[`6f062c0`](https://github.com/apache/spark/commit/6f062c00f6382d266619b4a56a753ec27d1db10b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19437
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19437
  
**[Test build #82475 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82475/testReport)**
 for PR 19437 at commit 
[`6f062c0`](https://github.com/apache/spark/commit/6f062c00f6382d266619b4a56a753ec27d1db10b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-05 Thread susanxhuynh
GitHub user susanxhuynh opened a pull request:

https://github.com/apache/spark/pull/19437

[SPARK-22131][MESOS] Mesos driver secrets

## Background

In #18837 , @ArtRand added Mesos secrets support to the dispatcher. **This 
PR is to add the same secrets support to the drivers.** This means if the 
secret configs are set, the driver will launch executors that have access to 
either env or file-based secrets.

One use case for this is to support TLS in the driver <=> executor 
communication.

## What changes were proposed in this pull request?

Most of the changes are a refactor of the dispatcher secrets support 
(#18837) - moving it to a common place that can be used by both the dispatcher 
and drivers. The same goes for the unit tests.

## How was this patch tested?

There are four config combinations: [env or file-based] x [value or 
reference secret]. For each combination:
- Added a unit test.
- Tested in DC/OS.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mesosphere/spark sh-mesos-driver-secret

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19437.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19437


commit b289bcc95f0b67cda94ddf416fc9a15e5d1855b4
Author: Susan X. Huynh 
Date:   2017-10-04T11:30:31Z

[SPARK-22131] Mesos driver secrets. The driver launches executors that have 
access to env or file-based secrets.

commit 6f062c00f6382d266619b4a56a753ec27d1db10b
Author: Susan X. Huynh 
Date:   2017-10-05T12:07:20Z

[SPARK-22131] Updated docs




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-05 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/18924
  
Thank you, @hhbyyh.

I have augmented the example a bit: explicitly set random seed a nd chosen 
online optimizer:

`val lda = new 
LDA().setK(10).setMaxIter(10).setOptimizer("online").setSeed(13)`

But for some reason if I run it twice, the results are not the same. Is 
that expected? branch-2.2 was used.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19369: [SPARK-22147][CORE] Removed redundant allocations from B...

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19369
  
**[Test build #3942 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3942/testReport)**
 for PR 19369 at commit 
[`d996c28`](https://github.com/apache/spark/commit/d996c283602269afd05dffad1e681f47f7baf47f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19369: [SPARK-22147][CORE] Removed redundant allocations from B...

2017-10-05 Thread superbobry
Github user superbobry commented on the issue:

https://github.com/apache/spark/pull/19369
  
I've fixed the failing `DiskStoreSuite` and ensured the other two suites 
also pass fine.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19436
  
**[Test build #82474 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82474/testReport)**
 for PR 19436 at commit 
[`71bf813`](https://github.com/apache/spark/commit/71bf813a4375a5736f903bffb3b17a29d2928d56).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work...

2017-10-05 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19436#discussion_r142903183
  
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -3075,6 +3075,11 @@ test_that("gapply() and gapplyCollect() on a 
DataFrame", {
   df1Collect <- gapplyCollect(df, list("a"), function(key, x) { x })
   expect_identical(df1Collect, expected)
 
+  # gapply on empty grouping columns.
+  dfTwoPartition <- repartition(df, 2L)
+  df1TwoPartition <- gapply(dfTwoPartition, c(), function(key, x) { x }, 
schema(dfTwoPartition))
+  expect_identical(sort(collect(df1TwoPartition)), sort(expected))
--- End diff --

Ok. Let me use your test code. I don't want to block this PR. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work...

2017-10-05 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19436#discussion_r142902434
  
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -3075,6 +3075,11 @@ test_that("gapply() and gapplyCollect() on a 
DataFrame", {
   df1Collect <- gapplyCollect(df, list("a"), function(key, x) { x })
   expect_identical(df1Collect, expected)
 
+  # gapply on empty grouping columns.
+  dfTwoPartition <- repartition(df, 2L)
+  df1TwoPartition <- gapply(dfTwoPartition, c(), function(key, x) { x }, 
schema(dfTwoPartition))
+  expect_identical(sort(collect(df1TwoPartition)), sort(expected))
--- End diff --

hmm, I think it should work. `repartition` is not necessary. I'm just 
wondering how to test this in R...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work...

2017-10-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19436#discussion_r142901810
  
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -3075,6 +3075,11 @@ test_that("gapply() and gapplyCollect() on a 
DataFrame", {
   df1Collect <- gapplyCollect(df, list("a"), function(key, x) { x })
   expect_identical(df1Collect, expected)
 
+  # gapply on empty grouping columns.
+  dfTwoPartition <- repartition(df, 2L)
+  df1TwoPartition <- gapply(dfTwoPartition, c(), function(key, x) { x }, 
schema(dfTwoPartition))
+  expect_identical(sort(collect(df1TwoPartition)), sort(expected))
--- End diff --

Actually, I tested these: 

```R
   df1 <- gapply(df, c(), function(key, x) { x }, schema(df))
   actual <- collect(df1)
   expect_identical(actual, expected)
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-05 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19436
  
Let me install R environment to test it locally...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19374: [SPARK-22145][MESOS] fix supervise with checkpointing on...

2017-10-05 Thread skonto
Github user skonto commented on the issue:

https://github.com/apache/spark/pull/19374
  
@ArtRand @susanxhuynh gentle ping.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19369: [SPARK-22147][CORE] Removed redundant allocations...

2017-10-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19369#discussion_r142896027
  
--- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala ---
@@ -67,7 +67,7 @@ private[spark] class DiskStore(
 var threwException: Boolean = true
 try {
   writeFunc(out)
-  blockSizes.put(blockId.name, out.getCount)
+  blockSizes.put(blockId, out.getCount)
--- End diff --

@superbobry I think the last test failure is legit as you need to update 
the call to remove(blockId.name) on about line 116.

I was surprised it even compiles, but, for legacy reasons the JDK 
collections classes don't have generic types on methods like remove, so it 
accepts any object. That however should be the last change here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19436
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82473/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19436
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19436
  
**[Test build #82473 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82473/testReport)**
 for PR 19436 at commit 
[`0e111a8`](https://github.com/apache/spark/commit/0e111a8d095c9ecdb9fb8249332b9e12c15e8fce).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19369: [SPARK-22147][CORE] Removed redundant allocations from B...

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19369
  
**[Test build #3941 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3941/testReport)**
 for PR 19369 at commit 
[`8590efe`](https://github.com/apache/spark/commit/8590efec78638735f170e9f6d2fd04c65724e20e).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19429: [SPARK-20055] [Docs] Added documentation for loading csv...

2017-10-05 Thread jomach
Github user jomach commented on the issue:

https://github.com/apache/spark/pull/19429
  
@felixcheung  Sorry for that. Should be there now. Can you test ? thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19369: [SPARK-22147][CORE] Removed redundant allocations from B...

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19369
  
**[Test build #3941 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3941/testReport)**
 for PR 19369 at commit 
[`8590efe`](https://github.com/apache/spark/commit/8590efec78638735f170e9f6d2fd04c65724e20e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19436
  
**[Test build #82473 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82473/testReport)**
 for PR 19436 at commit 
[`0e111a8`](https://github.com/apache/spark/commit/0e111a8d095c9ecdb9fb8249332b9e12c15e8fce).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-05 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19436
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19436
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19436
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82470/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19436
  
**[Test build #82470 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82470/testReport)**
 for PR 19436 at commit 
[`6710141`](https://github.com/apache/spark/commit/6710141767a2df92898af319bc4ef87f9110f911).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19436
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19436
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82472/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19436
  
**[Test build #82472 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82472/testReport)**
 for PR 19436 at commit 
[`0e111a8`](https://github.com/apache/spark/commit/0e111a8d095c9ecdb9fb8249332b9e12c15e8fce).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18732
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18732
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82469/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18732
  
**[Test build #82469 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82469/testReport)**
 for PR 18732 at commit 
[`e4efb32`](https://github.com/apache/spark/commit/e4efb3281008a2b450f9013aeb8f1ac9cf4ffa9e).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19436
  
**[Test build #82472 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82472/testReport)**
 for PR 19436 at commit 
[`0e111a8`](https://github.com/apache/spark/commit/0e111a8d095c9ecdb9fb8249332b9e12c15e8fce).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19287: [SPARK-22074][Core] Task killed by other attempt task sh...

2017-10-05 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/19287
  
lgtm, thanks @xuanyuanking 

@jerryshao can you merge this?  I will have very intermittent access for a 
few weeks, I'd prefer not to merge in case there is any issue that needs an 
urgent followup.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-05 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r142854372
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -322,6 +326,13 @@ final class OnlineLDAOptimizer extends LDAOptimizer {
 this
   }
 
+  @Since("2.3.0")
+  def setEpsilon(epsilon: Double): this.type = {
+require(epsilon> 0, s"LDA epsilon must be positive, but was set to 
$epsilon")
--- End diff --

space after epsilon.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-05 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r142853109
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -224,6 +224,20 @@ private[clustering] trait LDAParams extends Params 
with HasFeaturesCol with HasM
   /**
* For Online optimizer only: [[optimizer]] = "online".
*
+   * @group expertParam
--- End diff --

parameter comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-05 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r142853643
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -224,6 +224,20 @@ private[clustering] trait LDAParams extends Params 
with HasFeaturesCol with HasM
   /**
* For Online optimizer only: [[optimizer]] = "online".
*
+   * @group expertParam
+   */
+  @Since("2.3.0")
+  final val epsilon = new DoubleParam(this, "epsilon", "(For online 
optimizer)" +
+" A (positive) learning parameter that controls the convergence of 
variational inference.",
--- End diff --

The parameter introduction here cannot really help a user without knowledge 
of LDA implementation. Please add more description for the effect if user want 
to tune the parameter, such like "Smaller value will lead to higher accuracy 
with the cost of more iterations."


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   >