date:20170921

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18945
  
**[Test build #82067 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82067/testReport)**
 for PR 18945 at commit 
[`6e16cd8`](https://github.com/apache/spark/commit/6e16cd82434c82cd7213ae8ef2b52e1c42e607cf).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82067/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19281: [SPARK-21998][SQL] SortMergeJoinExec did not calc...

2017-09-21 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19281


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19281: [SPARK-21998][SQL] SortMergeJoinExec did not calculate i...

2017-09-21 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19281
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19281: [SPARK-21998][SQL] SortMergeJoinExec did not calculate i...

2017-09-21 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19281
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19261: [SPARK-22040] Add current_date function with timezone id

2017-09-21 Thread jaceklaskowski

Github user jaceklaskowski commented on the issue:

https://github.com/apache/spark/pull/19261
  
@rxin @gatorsmile Let me ask you a very similar question then, why does 
`CurrentDate` operator has the optional timezone parameter? What's the purpose? 
Wouldn't that answer your questions?

I don't mind not having the change, but am curious what is the reason for 
the "mismatch"?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19319: [SPARK-21766][PySpark][SQL] DataFrame toPandas() raises ...

2017-09-21 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19319
  
I'd go with this PR / approach. This approach and PR look pretty good. Let 
me help double check this tonight.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19320: [SPARK-22099] The 'job ids' list style needs to be chang...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19320
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19319: [SPARK-21766][PySpark][SQL] DataFrame toPandas() raises ...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19319
  
**[Test build #82069 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82069/testReport)**
 for PR 19319 at commit 
[`779eb40`](https://github.com/apache/spark/commit/779eb400790cb04ff4d62d7701a2af1d3d58175f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide ju...

2017-09-21 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18015#discussion_r140421501
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala
 ---
@@ -61,7 +59,37 @@ private[ui] class AllExecutionsPage(parent: SQLTab) 
extends WebUIPage("") with L
   
details.parentNode.querySelector('.stage-details').classList.toggle('collapsed')
 }}
   
-UIUtils.headerSparkPage("SQL", content, parent, Some(5000))
+
+val summary: NodeSeq =
+  
+
+  {
+  if (listener.getRunningExecutions.nonEmpty) {
+
+  Running 
Queries:
+  {listener.getRunningExecutions.size}
+
+  }
+  }
+  {
+  if (listener.getCompletedExecutions.nonEmpty) {
+
+  Completed 
Queries:
+  {listener.getCompletedExecutions.size}
+
+  }
+  }
+  {
+  if (listener.getFailedExecutions.nonEmpty) {
+
+  Failed 
Queries:
+  {listener.getFailedExecutions.size}
+
+  }
+  }
--- End diff --

Please follow the style in the other files of the package 
`org.apache.spark.sql.execution.ui`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19320: [SPARK-22099] The 'job ids' list style needs to b...

2017-09-21 Thread guoxiaolongzte

GitHub user guoxiaolongzte opened a pull request:

https://github.com/apache/spark/pull/19320

[SPARK-22099] The 'job ids' list style needs to be changed in the SQL page.

## What changes were proposed in this pull request?

The 'job ids' list style needs to be changed in the SQL page. There are two 
reasons:
1. If a job id is a line, there are a lot of job ids, then the table row 
height will be high. As shown below:

2. should be consistent with the 'JDBC / ODBC Server' page style, I am in 
this way to modify the style. As shown below:


My changes are as follows:

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/guoxiaolongzte/spark SPARK-22099

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19320.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19320


commit f6331c4b9922c3fce4bb2a8b0fedb66c16017b75
Author: guoxiaolong 
Date:   2017-09-22T06:33:52Z

[SPARK-22099] The 'job ids' list style needs to be changed in the SQL page




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19319: [SPARK-21766][PySpark][SQL] DataFrame toPandas() raises ...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19319
  
**[Test build #82068 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82068/testReport)**
 for PR 19319 at commit 
[`e12f576`](https://github.com/apache/spark/commit/e12f5768436543bfbb78fd0bb39b48c96e04286c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140420898
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1760,13 +1760,39 @@ def toPandas(self):
   "if using spark.sql.execution.arrow.enable=true"
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
+import numpy as np
 dtype = {}
+nullable_int_columns = set()
+
+def null_handler(rows, nullable_int_columns):
+requires_double_precision = set()
+for row in rows:
+row = row.asDict()
+for column in nullable_int_columns:
+val = row[column]
+dt = dtype[column]
+if val is None and dt not in (np.float32, 
np.float64):
+dt = np.float64 if column in 
requires_double_precision else np.float32
+dtype[column] = dt
+elif val is not None:
+if abs(val) > 16777216:  # Max value before 
np.float32 loses precision.
--- End diff --

I think they are represented as np.float64. I added a test in #19319.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19319: [SPARK-21766][PySpark][SQL] DataFrame toPandas() ...

2017-09-21 Thread viirya

GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/19319

[SPARK-21766][PySpark][SQL] DataFrame toPandas() raises ValueError with 
nullable int columns

## What changes were proposed in this pull request?

When calling `DataFrame.toPandas()` (without Arrow enabled), if there is a 
`IntegralType` column (`IntegerType`, `ShortType`, `ByteType`) that has null 
values the following exception is thrown:

ValueError: Cannot convert non-finite values (NA or inf) to integer

This is because the null values first get converted to float NaN during the 
construction of the Pandas DataFrame in `from_records`, and then it is 
attempted to be converted back to to an integer where it fails.

The fix is going to check if the Pandas DataFrame can cause such failure 
when converting, if so, we don't do the conversion and use the inferred type by 
Pandas.

## How was this patch tested?

Added pyspark test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 SPARK-21766

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19319.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19319






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18945
  
**[Test build #82067 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82067/testReport)**
 for PR 18945 at commit 
[`6e16cd8`](https://github.com/apache/spark/commit/6e16cd82434c82cd7213ae8ef2b52e1c42e607cf).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19303: [SPARK-22085][CORE]When the application has no core left...

2017-09-21 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19303
  
IIUC, if there's no core left, requesting new executors should be a no-op, 
am I right? So there should be no problem even without your fix?

From your patch, it looks like you're putting standalone specific logic to 
this general `ExecutorAllocationManager`, personally I would suggest not to do 
it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread logannc

Github user logannc commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140419964
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1760,13 +1760,39 @@ def toPandas(self):
   "if using spark.sql.execution.arrow.enable=true"
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
+import numpy as np
 dtype = {}
+nullable_int_columns = set()
+
+def null_handler(rows, nullable_int_columns):
+requires_double_precision = set()
+for row in rows:
+row = row.asDict()
+for column in nullable_int_columns:
+val = row[column]
+dt = dtype[column]
+if val is None and dt not in (np.float32, 
np.float64):
+dt = np.float64 if column in 
requires_double_precision else np.float32
+dtype[column] = dt
+elif val is not None:
+if abs(val) > 16777216:  # Max value before 
np.float32 loses precision.
--- End diff --

Values above this cannot be represented losslessly as a `np.float32`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18945
  
Hey @logannc, let's don't make it complicated for now and go with their 
ways first - https://github.com/apache/spark/pull/18945#discussion_r134033952 
and https://github.com/apache/spark/pull/18945#discussion_r134925269.

Maybe we can make a followup later with some small benchmark results for 
the performance one and precision concern (I guess this one is not a regression 
BTW?). I think we should first match it with when 
`spark.sql.execution.arrow.enable` is enabled.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19279: [SPARK-22061] [ML]add pipeline model of SVM

2017-09-21 Thread daweicheng

Github user daweicheng closed the pull request at:

https://github.com/apache/spark/pull/19279


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82066/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18945
  
**[Test build #82066 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82066/testReport)**
 for PR 18945 at commit 
[`d93a203`](https://github.com/apache/spark/commit/d93a2030d366bf1eb5ae2d6cc335894eddbc48dd).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140419255
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1760,13 +1760,39 @@ def toPandas(self):
   "if using spark.sql.execution.arrow.enable=true"
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
+import numpy as np
 dtype = {}
+nullable_int_columns = set()
+
+def null_handler(rows, nullable_int_columns):
+requires_double_precision = set()
+for row in rows:
+row = row.asDict()
+for column in nullable_int_columns:
+val = row[column]
+dt = dtype[column]
+if val is None and dt not in (np.float32, 
np.float64):
+dt = np.float64 if column in 
requires_double_precision else np.float32
+dtype[column] = dt
+elif val is not None:
+if abs(val) > 16777216:  # Max value before 
np.float32 loses precision.
--- End diff --

Why do we need this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82063/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18945
  
**[Test build #82063 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82063/testReport)**
 for PR 18945 at commit 
[`bd25923`](https://github.com/apache/spark/commit/bd259239c550b0b19311968aff9a69da29a6a05e).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82065/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19315: Updated english.txt word ordering

2017-09-21 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19315
  
@animenon can you please fix the PR title like what other PR did. Also is 
this only for better readability or do you fix any other issue? IMO, I found 
that previous txt is more readable than your change, since they're ordered by 
different kind.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide jump link...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18015
  
**[Test build #82064 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82064/testReport)**
 for PR 18015 at commit 
[`21e2c31`](https://github.com/apache/spark/commit/21e2c31369b2223d0bee16b9bc98373ab0ec59a9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread logannc

Github user logannc commented on the issue:

https://github.com/apache/spark/pull/18945
  
I've continued to use @HyukjinKwon 's suggestion because it should be more 
performant and is capable of handling it without loss of precision. I believe 
I've addressed your concerns by only changing the type when we encounter a null 
(duh).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18945
  
**[Test build #82063 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82063/testReport)**
 for PR 18945 at commit 
[`bd25923`](https://github.com/apache/spark/commit/bd259239c550b0b19311968aff9a69da29a6a05e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19301: [SPARK-22084][SQL] Fix performance regression in ...

2017-09-21 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19301#discussion_r140416279
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala
 ---
@@ -72,11 +74,19 @@ object AggregateExpression {
   aggregateFunction: AggregateFunction,
   mode: AggregateMode,
   isDistinct: Boolean): AggregateExpression = {
+val state = if (aggregateFunction.resolved) {
+  Seq(aggregateFunction.toString, aggregateFunction.dataType,
+aggregateFunction.nullable, mode, isDistinct)
+} else {
+  Seq(aggregateFunction.toString, mode, isDistinct)
+}
+val hashCode = state.map(Objects.hashCode).foldLeft(0)((a, b) => 31 * 
a + b)
+
 AggregateExpression(
   aggregateFunction,
   mode,
   isDistinct,
-  NamedExpression.newExprId)
+  ExprId(hashCode))
--- End diff --

I don't think this is the right fix. Semantically the `b0` and `b1` in 
`SELECT SUM(b) AS b0, SUM(b) AS b1 ` are different aggregate functions, so they 
should have different `resultId`.

It's kind of an optimization in aggregate planner, we should detect these 
semantically different but duplicated aggregate functions and only plan one 
aggrega function.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide jump link...

2017-09-21 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/18015
  
Yes, I'm fine with it. @ajbozarth would you please take another look on 
this PR? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide jump link...

2017-09-21 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/18015
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-09-21 Thread DaimonPl

Github user DaimonPl commented on the issue:

https://github.com/apache/spark/pull/16578
  
@mallman how about adding comment explaining why such workaround was done + 
bug number in parquet-mr ? So in future once that bug is fixed, code can be 
cleaned.

Also maybe it's time to remove "DO NOT MERGE" from title? As I understand 
most of comments were addressed :)

Thank you very much for work on this feature. I must admit that we are 
looking forward to have this merged. For us this will be most important 
improvement in Spark 2.3.0 (I hope it will be part of 2.3.0 :) )


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide ju...

2017-09-21 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18015#discussion_r140416046
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala
 ---
@@ -61,7 +59,37 @@ private[ui] class AllExecutionsPage(parent: SQLTab) 
extends WebUIPage("") with L
   
details.parentNode.querySelector('.stage-details').classList.toggle('collapsed')
 }}
   
-UIUtils.headerSparkPage("SQL", content, parent, Some(5000))
+
+val summary: NodeSeq =
+  
+
+  {
+  if (listener.getRunningExecutions.nonEmpty) {
+
+  Running 
Queries:
+  {listener.getRunningExecutions.size}
+
+  }
+  }
+  {
+  if (listener.getCompletedExecutions.nonEmpty) {
+
+  Completed 
Queries:
+  {listener.getCompletedExecutions.size}
+
+  }
+  }
+  {
+  if (listener.getFailedExecutions.nonEmpty) {
+
+  Failed 
Queries:
+  {listener.getFailedExecutions.size}
+
+  }
+  }
--- End diff --

Is the indention here correct? This seems a little weird to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82062/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18945
  
**[Test build #82062 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82062/testReport)**
 for PR 18945 at commit 
[`6e248dd`](https://github.com/apache/spark/commit/6e248ddf96122910468a3f20125ff4fc9f32f299).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140415073
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = set()
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
+if abs(val) > 16777216: # Max value before 
np.float32 loses precision.
+val = np.float64(val)
+dt = np.float64
+dtype[column] = np.float64
+else:
+val = np.float32(val)
+if dt not in (np.float32, np.float64):
+dt = np.float32
+dtype[column] = np.float32
+row[column] = val
+row = Row(**row)
+yield row
+row_handler = lambda x,y: x
 for field in self.schema:
 pandas_type = _to_corrected_pandas_type(field.dataType)
+if pandas_type in (np.int8, np.int16, np.int32) and 
field.nullable:
+columns_with_null_int.add(field.name)
+row_handler = null_handler
+pandas_type = np.float32
--- End diff --

I will take my suggestion back. I think thier suggestions are better than 
mine.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread logannc

Github user logannc commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140414783
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = set()
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
+if abs(val) > 16777216: # Max value before 
np.float32 loses precision.
+val = np.float64(val)
+dt = np.float64
+dtype[column] = np.float64
+else:
+val = np.float32(val)
+if dt not in (np.float32, np.float64):
+dt = np.float32
+dtype[column] = np.float32
+row[column] = val
+row = Row(**row)
+yield row
+row_handler = lambda x,y: x
 for field in self.schema:
 pandas_type = _to_corrected_pandas_type(field.dataType)
+if pandas_type in (np.int8, np.int16, np.int32) and 
field.nullable:
+columns_with_null_int.add(field.name)
+row_handler = null_handler
+pandas_type = np.float32
--- End diff --

Ah, I see where I got confused. I had started with @ueshin 's suggestion 
but abandoned it because I didn't want to create the DataFrame before the type 
correction because I was also looking at @HyukjinKwon 's suggestion. I somehow 
ended up combining them incorrectly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18945
  
**[Test build #82062 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82062/testReport)**
 for PR 18945 at commit 
[`6e248dd`](https://github.com/apache/spark/commit/6e248ddf96122910468a3f20125ff4fc9f32f299).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140414202
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = set()
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
+if abs(val) > 16777216: # Max value before 
np.float32 loses precision.
+val = np.float64(val)
+dt = np.float64
+dtype[column] = np.float64
+else:
+val = np.float32(val)
+if dt not in (np.float32, np.float64):
+dt = np.float32
+dtype[column] = np.float32
+row[column] = val
+row = Row(**row)
+yield row
+row_handler = lambda x,y: x
 for field in self.schema:
 pandas_type = _to_corrected_pandas_type(field.dataType)
+if pandas_type in (np.int8, np.int16, np.int32) and 
field.nullable:
+columns_with_null_int.add(field.name)
+row_handler = null_handler
+pandas_type = np.float32
--- End diff --

A simple wrong for this line is, even this condition is met, don't 
necessarily meaning there are null values in the column. But you forcibly set 
the type to np.float32. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140414042
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = set()
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
+if abs(val) > 16777216: # Max value before 
np.float32 loses precision.
+val = np.float64(val)
+dt = np.float64
+dtype[column] = np.float64
+else:
+val = np.float32(val)
+if dt not in (np.float32, np.float64):
+dt = np.float32
+dtype[column] = np.float32
+row[column] = val
+row = Row(**row)
+yield row
+row_handler = lambda x,y: x
 for field in self.schema:
 pandas_type = _to_corrected_pandas_type(field.dataType)
+if pandas_type in (np.int8, np.int16, np.int32) and 
field.nullable:
+columns_with_null_int.add(field.name)
+row_handler = null_handler
+pandas_type = np.float32
--- End diff --

Have you read carefully the comments in 
https://github.com/apache/spark/pull/18945#discussion_r134033952, 
https://github.com/apache/spark/pull/18945#discussion_r134925269? They are good 
suggestions for this issue. I don't know why you don't want to follow them to 
check null values with Pandas...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread logannc

Github user logannc commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140413579
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = set()
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
+if abs(val) > 16777216: # Max value before 
np.float32 loses precision.
+val = np.float64(val)
+dt = np.float64
+dtype[column] = np.float64
+else:
+val = np.float32(val)
+if dt not in (np.float32, np.float64):
+dt = np.float32
+dtype[column] = np.float32
+row[column] = val
+row = Row(**row)
+yield row
+row_handler = lambda x,y: x
 for field in self.schema:
 pandas_type = _to_corrected_pandas_type(field.dataType)
+if pandas_type in (np.int8, np.int16, np.int32) and 
field.nullable:
+columns_with_null_int.add(field.name)
+row_handler = null_handler
+pandas_type = np.float32
--- End diff --

Can you elaborate? I believe it is, per my reply to your comment in the 
`null_handler`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...

2017-09-21 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19204


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19204: [SPARK-21981][PYTHON][ML] Added Python interface for Clu...

2017-09-21 Thread yanboliang

Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/19204
  
Merged into master, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18945
  
**[Test build #82061 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82061/testReport)**
 for PR 18945 at commit 
[`14f36c3`](https://github.com/apache/spark/commit/14f36c354f65a34e3e06cd4d35029e5f8f2b79f0).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82061/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18945
  
**[Test build #82061 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82061/testReport)**
 for PR 18945 at commit 
[`14f36c3`](https://github.com/apache/spark/commit/14f36c354f65a34e3e06cd4d35029e5f8f2b79f0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread logannc

Github user logannc commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140412857
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = {}
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
--- End diff --

If `pandas_type in (np.int8, np.int16, np.int32) and field.nullable` and 
there are ANY non-null values, the dtype of the column is changed to 
`np.float32` or `np.float64`, both of which properly handle `None` values.

That said, if the entire column was `None`, it would fail. Therefore I have 
preemptively changed the type on line 1787 to `np.float32`. Per `null_handler`, 
it may still change to `np.float64` if needed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread logannc

Github user logannc commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140412745
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = {}
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
--- End diff --

If `pandas_type in (np.int8, np.int16, np.int32) and field.nullable` and 
there are ANY non-null values, the dtype of the column is changed to 
`np.float32` or `np.float64`, both of which properly handle `None` values.

That said, if the entire column was `None`, it would fail. Therefore I have 
preemptively changed the type on line 1787 to `np.float32`. Per `null_handler`, 
it may still change to `np.float64` if needed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140412632
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = set()
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
+if abs(val) > 16777216: # Max value before 
np.float32 loses precision.
+val = np.float64(val)
+dt = np.float64
+dtype[column] = np.float64
+else:
+val = np.float32(val)
+if dt not in (np.float32, np.float64):
+dt = np.float32
+dtype[column] = np.float32
+row[column] = val
+row = Row(**row)
+yield row
+row_handler = lambda x,y: x
 for field in self.schema:
 pandas_type = _to_corrected_pandas_type(field.dataType)
+if pandas_type in (np.int8, np.int16, np.int32) and 
field.nullable:
+columns_with_null_int.add(field.name)
+row_handler = null_handler
+pandas_type = np.float32
--- End diff --

I don't think this is a correct fix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18945
  
**[Test build #82060 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82060/testReport)**
 for PR 18945 at commit 
[`b313a3b`](https://github.com/apache/spark/commit/b313a3b8fc88898423940f195ab16bd3a57c0061).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82060/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18945
  
**[Test build #82060 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82060/testReport)**
 for PR 18945 at commit 
[`b313a3b`](https://github.com/apache/spark/commit/b313a3b8fc88898423940f195ab16bd3a57c0061).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19229: [SPARK-22001][ML][SQL] ImputerModel can do withCo...

2017-09-21 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19229#discussion_r140412254
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2102,6 +2102,55 @@ class Dataset[T] private[sql](
   }
 
   /**
+   * Returns a new Dataset by adding columns or replacing the existing 
columns that has
+   * the same names.
+   */
+  private[spark] def withColumns(colNames: Seq[String], cols: 
Seq[Column]): DataFrame = {
--- End diff --

@cloud-fan should look at this `withColumns` before in #17819. cc 
@cloud-fan to see if you has more comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19314: [SPARK-22094][SS]processAllAvailable should check...

2017-09-21 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19314


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-21 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19229
  
ping @zhengruifeng @WeichenXu123 Any more comments on this? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...

2017-09-21 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/19314
  
Thanks! Merging to master and branch-2.2


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19290: [WIP][SPARK-22063][R] Upgrades lintr to latest commit sh...

2017-09-21 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19290
  
I initially did this, for example,

```

\href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-
r-and-spark}{Spark Data Types} for available data types.
```

this passes the lint and doc is find but cran check is failed. I actually 
tried to find out the way but ended up with `nolint`. ... will try to read the 
doc one more and out few more cases locally.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19290: [WIP][SPARK-22063][R] Upgrades lintr to latest commit sh...

2017-09-21 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19290
  
Doh, you mean the current status. Yes, I checked.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release ...

2017-09-21 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19312#discussion_r140410448
  
--- Diff: dev/create-release/release-build.sh ---
@@ -95,6 +95,28 @@ if [ -z "$SPARK_VERSION" ]; then
 | grep -v INFO | grep -v WARNING | grep -v Download)
 fi
 
+# Verify we have the right java version set
+java_version=$("${JAVA_HOME}"/bin/javac -version 2>&1 | cut -d " " -f 2)
--- End diff --

@holdenk, should we maybe catch the case when `JAVA_HOME` is missing too?

If so, I think we could do something like ...

```bash
if [ -z "$JAVA_HOME" ]; then
  echo "Please set JAVA_HOME."
  exit 1
fi
...
```

Or maybe...

```bash
if [[ -x "$JAVA_HOME/bin/javac" ]]; then
  javac_cmd="$JAVA_HOME/bin/javac"
else
  javac_cmd=javac
fi

java_version=$("$javac_cmd" -version 2>&1 | cut -d " " -f 2)
...
```

I tested both in my local.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19318: [SPARK-22096][ML] use aggregateByKeyLocally in feature f...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19318
  
**[Test build #82059 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82059/testReport)**
 for PR 19318 at commit 
[`efb0fe9`](https://github.com/apache/spark/commit/efb0fe9c0544d8666c423ba9bde533735961ea75).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19318: [SPARK-22096][ML] use aggregateByKeyLocally in feature f...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19318
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82059/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19318: [SPARK-22096][ML] use aggregateByKeyLocally in feature f...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19318
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19290: [WIP][SPARK-22063][R] Upgrades lintr to latest commit sh...

2017-09-21 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/19290
  
btw, could you check if haven't already, if `nolint` around the `http` 
link, roxygen is going to handle that correctly?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19318: [SPARK-22096][ML] use aggregateByKeyLocally in feature f...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19318
  
**[Test build #82059 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82059/testReport)**
 for PR 19318 at commit 
[`efb0fe9`](https://github.com/apache/spark/commit/efb0fe9c0544d8666c423ba9bde533735961ea75).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19318: [SPARK-22096][ML] use aggregateByKeyLocally in fe...

2017-09-21 Thread VinceShieh

GitHub user VinceShieh opened a pull request:

https://github.com/apache/spark/pull/19318

[SPARK-22096][ML] use aggregateByKeyLocally in feature frequency calcâ¦

## What changes were proposed in this pull request?

NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
frequency for each feature/label. We can implement a new function 
'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
sending results to a reducer to save one stage.

We tested on NaiveBayes and see ~20% performance gain with these changes.

Signed-off-by: Vincent Xie 
## How was this patch tested?
existing test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/VinceShieh/spark SPARK-22096

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19318.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19318


commit efb0fe9c0544d8666c423ba9bde533735961ea75
Author: Vincent Xie 
Date:   2017-09-22T03:57:08Z

[SPARK-22096][ML] use aggregateByKeyLocally in feature frequency calculation

NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
frequency for each feature/label. We can implement a new function 
'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
sending results to a reducer to save one stage.

We tested on NaiveBayes and see ~20% performance gain with these changes.

Signed-off-by: Vincent Xie 




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19122
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19122
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82058/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19122
  
**[Test build #82058 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82058/testReport)**
 for PR 19122 at commit 
[`3464dfe`](https://github.com/apache/spark/commit/3464dfea1f008e945a5e608b593877d1cbdf0e35).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19317
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-21 Thread ConeyLiu

Github user ConeyLiu commented on the issue:

https://github.com/apache/spark/pull/19317
  
cc @VinceShieh 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19317: [SPARK-22098][CORE] Add new method aggregateByKey...

2017-09-21 Thread ConeyLiu

GitHub user ConeyLiu opened a pull request:

https://github.com/apache/spark/pull/19317

[SPARK-22098][CORE] Add new method aggregateByKeyLocally in RDD

## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-22096

NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
frequency for each feature/label. We can implement a new function 
'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
sending results to a reducer to save one stage.
We tested on NaiveBayes and see ~20% performance gain with these changes.

This is a subtask of our improvement.

## How was this patch tested?

New UT.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ConeyLiu/spark aggregatebykeylocally

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19317.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19317


commit 73a85dc5963ac46f181a9499deabb18da4ccc308
Author: Xianyang Liu 
Date:   2017-08-31T05:16:09Z

add new method 'aggregateByKeyLocally'




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19316: [SPARK-22097][CORE]Call serializationStream.close after ...

2017-09-21 Thread ConeyLiu

Github user ConeyLiu commented on the issue:

https://github.com/apache/spark/pull/19316
  
@cloud-fan Pls take a look. Thanks a lot.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release build s...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19312
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82056/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release build s...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19312
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release build s...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19312
  
**[Test build #82056 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82056/testReport)**
 for PR 19312 at commit 
[`aa4cbf6`](https://github.com/apache/spark/commit/aa4cbf69b080435bc836dc9820307fba6588).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19316: [SPARK-22097][CORE]Call serializationStream.close...

2017-09-21 Thread ConeyLiu

Github user ConeyLiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19316#discussion_r140408246
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala ---
@@ -387,11 +387,18 @@ private[spark] class MemoryStore(
 // the block's actual memory usage has exceeded the unroll memory by a 
small amount, so we
 // perform one final call to attempt to allocate additional memory if 
necessary.
 if (keepUnrolling) {
-  serializationStream.close()
-  reserveAdditionalMemoryIfNecessary()
+  serializationStream.flush()
+  if (bbos.size > unrollMemoryUsedByThisBlock) {
+val amountToRequest = bbos.size - unrollMemoryUsedByThisBlock
+keepUnrolling = reserveUnrollMemoryForThisTask(blockId, 
amountToRequest, memoryMode)
+if (keepUnrolling) {
+  unrollMemoryUsedByThisBlock += amountToRequest
+}
+  }
 }
 
 if (keepUnrolling) {
+  serializationStream.close()
--- End diff --

Here, we should close the `serializationStream` after we check it again. 
Previous code we close it first, and then request the exceed memory. So there 
is a potential problem that we can't request enought memory, while the 
`serializationStream` is closeed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19316: [SPARK-22097][CORE]Call serializationStream.close...

2017-09-21 Thread ConeyLiu

Github user ConeyLiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19316#discussion_r140408116
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala ---
@@ -387,11 +387,18 @@ private[spark] class MemoryStore(
 // the block's actual memory usage has exceeded the unroll memory by a 
small amount, so we
 // perform one final call to attempt to allocate additional memory if 
necessary.
 if (keepUnrolling) {
-  serializationStream.close()
-  reserveAdditionalMemoryIfNecessary()
+  serializationStream.flush()
+  if (bbos.size > unrollMemoryUsedByThisBlock) {
+val amountToRequest = bbos.size - unrollMemoryUsedByThisBlock
--- End diff --

Here, we only need request the `bbos.size - unrollMemoryUsedByThisBlock`. 
I'm sorry, this mistake maybe introduced by my previous patch 
[SPARK-21923](https://github.com/apache/spark/pull/19135).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19316: [SPARK-22097][CORE]Call serializationStream.close after ...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19316
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19316: [SPARK-22097][CORE]Call serializationStream.close...

2017-09-21 Thread ConeyLiu

GitHub user ConeyLiu opened a pull request:

https://github.com/apache/spark/pull/19316

[SPARK-22097][CORE]Call serializationStream.close after we requested enough 
memory

## What changes were proposed in this pull request?


Current code, we close the `serializationStream` after we unrolled the 
block. However, there is a otential problem that the size of underlying vector 
or stream maybe larger the memory we requested. So here, we need check it agin 
carefully.

## How was this patch tested?

Existing UT.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ConeyLiu/spark putIteratorAsBytes

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19316.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19316


commit bfe162e3aad300414dcc3fe25a3d70025e1795dd
Author: Xianyang Liu 
Date:   2017-09-22T03:29:39Z

close the serializationStream after check the memory request




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19168: [SPARK-21956][CORE] Fetch up to max bytes when buf reall...

2017-09-21 Thread caneGuy

Github user caneGuy commented on the issue:

https://github.com/apache/spark/pull/19168
  
Sorry for replying so late.
I add some benchmark testing for this pr @kiszk .
And @jerryshao could you help review this pr?Thanks
```
Running benchmark: Benchmark fetch before vs after releasing buffer
  Running case: Testing fetch before releasing!
  Stopped after 10 iterations, 2423 ms
  Running case: Testing fetch after releasing!
  Stopped after 18 iterations, 2036 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17 on Linux 4.4.0-64-generic
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Benchmark fetch before vs after releasing buffer: Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


Testing fetch before releasing! 46 /  242345.0  
 2.9   1.0X
Testing fetch after releasing!  73 /  113215.7  
 4.6   0.6X
```
```
Running benchmark: Benchmark fetch before vs after releasing buffer
  Running case: Testing fetch before releasing!
  Stopped after 10 iterations, 3888 ms
  Running case: Testing fetch after releasing!
  Stopped after 10 iterations, 3970 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17 on Linux 4.4.0-64-generic
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Benchmark fetch before vs after releasing buffer: Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


Testing fetch before releasing!100 /  389157.8  
 6.3   1.0X
Testing fetch after releasing! 151 /  397104.3  
 9.6   0.7X
```
```
Running benchmark: Benchmark fetch before vs after releasing buffer
  Running case: Testing fetch before releasing!
  Stopped after 15 iterations, 2016 ms
  Running case: Testing fetch after releasing!
  Stopped after 14 iterations, 2110 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17 on Linux 4.4.0-64-generic
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Benchmark fetch before vs after releasing buffer: Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


Testing fetch before releasing! 43 /  134363.8  
 2.7   1.0X
Testing fetch after releasing!  99 /  151158.1  
 6.3   0.4X
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19278
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19278
  
**[Test build #82057 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82057/testReport)**
 for PR 19278 at commit 
[`8f78f59`](https://github.com/apache/spark/commit/8f78f596473877f3e8a0169f998f16a6bf1a8f5a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19278
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82057/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...

2017-09-21 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19278
  
@jkbradley Sure I tested the backwards compatibility. Part of the reason I 
changed into `DefaultParamReader.getAndSetParams` is for backwards 
compatibility.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19122
  
**[Test build #82058 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82058/testReport)**
 for PR 19122 at commit 
[`3464dfe`](https://github.com/apache/spark/commit/3464dfea1f008e945a5e608b593877d1cbdf0e35).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...

2017-09-21 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19122#discussion_r140402700
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -836,6 +836,27 @@ def test_save_load_simple_estimator(self):
 loadedModel = CrossValidatorModel.load(cvModelPath)
 self.assertEqual(loadedModel.bestModel.uid, cvModel.bestModel.uid)
 
+def test_parallel_evaluation(self):
+dataset = self.spark.createDataFrame(
+[(Vectors.dense([0.0]), 0.0),
+ (Vectors.dense([0.4]), 1.0),
+ (Vectors.dense([0.5]), 0.0),
+ (Vectors.dense([0.6]), 1.0),
+ (Vectors.dense([1.0]), 1.0)] * 10,
+["features", "label"])
+
+lr = LogisticRegression()
+grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
+evaluator = BinaryClassificationEvaluator()
+
+# test save/load of CrossValidator
+cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
evaluator=evaluator)
+cv.setParallelism(1)
+cvSerialModel = cv.fit(dataset)
+cv.setParallelism(2)
+cvParallelModel = cv.fit(dataset)
+self.assertEqual(sorted(cvSerialModel.avgMetrics), 
sorted(cvParallelModel.avgMetrics))
--- End diff --

hmm... I tried. But how to get model parents ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19315: Updated english.txt word ordering

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19315
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19315: Updated english.txt word ordering

2017-09-21 Thread animenon

GitHub user animenon opened a pull request:

https://github.com/apache/spark/pull/19315

Updated english.txt word ordering

Ordered alphabetically, for better readability.

## What changes were proposed in this pull request?

Alphabetical ordering of the stop words.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/animenon/spark patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19315.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19315


commit 57c282721c63487a82bdd6959c6ff5f6ce9f66ad
Author: Anirudh 
Date:   2017-09-22T02:40:30Z

Updated english.txt word ordering

Ordered alphabetically, for better readability.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19314
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82055/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...

2017-09-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19314
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...

2017-09-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19314
  
**[Test build #82055 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82055/testReport)**
 for PR 19314 at commit 
[`a4a02a6`](https://github.com/apache/spark/commit/a4a02a69bf41906c03e46c50d0eca75d6844465a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13794: [SPARK-15574][ML][PySpark] Python meta-algorithms in Sca...

2017-09-21 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/13794
  
cc @srowen Can you help close this ? We won't need this feature for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 422 matches

Mail list logo