[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82065/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19315: Updated english.txt word ordering

2017-09-21 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19315
  
@animenon can you please fix the PR title like what other PR did. Also is 
this only for better readability or do you fix any other issue? IMO, I found 
that previous txt is more readable than your change, since they're ordered by 
different kind.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide jump link...

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18015
  
**[Test build #82064 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82064/testReport)**
 for PR 18015 at commit 
[`21e2c31`](https://github.com/apache/spark/commit/21e2c31369b2223d0bee16b9bc98373ab0ec59a9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread logannc
Github user logannc commented on the issue:

https://github.com/apache/spark/pull/18945
  
I've continued to use @HyukjinKwon 's suggestion because it should be more 
performant and is capable of handling it without loss of precision. I believe 
I've addressed your concerns by only changing the type when we encounter a null 
(duh).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18945
  
**[Test build #82063 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82063/testReport)**
 for PR 18945 at commit 
[`bd25923`](https://github.com/apache/spark/commit/bd259239c550b0b19311968aff9a69da29a6a05e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19301: [SPARK-22084][SQL] Fix performance regression in ...

2017-09-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19301#discussion_r140416279
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala
 ---
@@ -72,11 +74,19 @@ object AggregateExpression {
   aggregateFunction: AggregateFunction,
   mode: AggregateMode,
   isDistinct: Boolean): AggregateExpression = {
+val state = if (aggregateFunction.resolved) {
+  Seq(aggregateFunction.toString, aggregateFunction.dataType,
+aggregateFunction.nullable, mode, isDistinct)
+} else {
+  Seq(aggregateFunction.toString, mode, isDistinct)
+}
+val hashCode = state.map(Objects.hashCode).foldLeft(0)((a, b) => 31 * 
a + b)
+
 AggregateExpression(
   aggregateFunction,
   mode,
   isDistinct,
-  NamedExpression.newExprId)
+  ExprId(hashCode))
--- End diff --

I don't think this is the right fix. Semantically the `b0` and `b1` in 
`SELECT SUM(b) AS b0, SUM(b) AS b1 ` are different aggregate functions, so they 
should have different `resultId`.

It's kind of an optimization in aggregate planner, we should detect these 
semantically different but duplicated aggregate functions and only plan one 
aggrega function.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide jump link...

2017-09-21 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/18015
  
Yes, I'm fine with it. @ajbozarth would you please take another look on 
this PR? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide jump link...

2017-09-21 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/18015
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-09-21 Thread DaimonPl
Github user DaimonPl commented on the issue:

https://github.com/apache/spark/pull/16578
  
@mallman how about adding comment explaining why such workaround was done + 
bug number in parquet-mr ? So in future once that bug is fixed, code can be 
cleaned.

Also maybe it's time to remove "DO NOT MERGE" from title? As I understand 
most of comments were addressed :)

Thank you very much for work on this feature. I must admit that we are 
looking forward to have this merged. For us this will be most important 
improvement in Spark 2.3.0 (I hope it will be part of 2.3.0 :) )


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide ju...

2017-09-21 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18015#discussion_r140416046
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala
 ---
@@ -61,7 +59,37 @@ private[ui] class AllExecutionsPage(parent: SQLTab) 
extends WebUIPage("") with L
   
details.parentNode.querySelector('.stage-details').classList.toggle('collapsed')
 }}
   
-UIUtils.headerSparkPage("SQL", content, parent, Some(5000))
+
+val summary: NodeSeq =
+  
+
+  {
+  if (listener.getRunningExecutions.nonEmpty) {
+
+  Running 
Queries:
+  {listener.getRunningExecutions.size}
+
+  }
+  }
+  {
+  if (listener.getCompletedExecutions.nonEmpty) {
+
+  Completed 
Queries:
+  {listener.getCompletedExecutions.size}
+
+  }
+  }
+  {
+  if (listener.getFailedExecutions.nonEmpty) {
+
+  Failed 
Queries:
+  {listener.getFailedExecutions.size}
+
+  }
+  }
--- End diff --

Is the indention here correct? This seems a little weird to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82062/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18945
  
**[Test build #82062 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82062/testReport)**
 for PR 18945 at commit 
[`6e248dd`](https://github.com/apache/spark/commit/6e248ddf96122910468a3f20125ff4fc9f32f299).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140415073
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = set()
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
+if abs(val) > 16777216: # Max value before 
np.float32 loses precision.
+val = np.float64(val)
+dt = np.float64
+dtype[column] = np.float64
+else:
+val = np.float32(val)
+if dt not in (np.float32, np.float64):
+dt = np.float32
+dtype[column] = np.float32
+row[column] = val
+row = Row(**row)
+yield row
+row_handler = lambda x,y: x
 for field in self.schema:
 pandas_type = _to_corrected_pandas_type(field.dataType)
+if pandas_type in (np.int8, np.int16, np.int32) and 
field.nullable:
+columns_with_null_int.add(field.name)
+row_handler = null_handler
+pandas_type = np.float32
--- End diff --

I will take my suggestion back. I think thier suggestions are better than 
mine.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread logannc
Github user logannc commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140414783
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = set()
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
+if abs(val) > 16777216: # Max value before 
np.float32 loses precision.
+val = np.float64(val)
+dt = np.float64
+dtype[column] = np.float64
+else:
+val = np.float32(val)
+if dt not in (np.float32, np.float64):
+dt = np.float32
+dtype[column] = np.float32
+row[column] = val
+row = Row(**row)
+yield row
+row_handler = lambda x,y: x
 for field in self.schema:
 pandas_type = _to_corrected_pandas_type(field.dataType)
+if pandas_type in (np.int8, np.int16, np.int32) and 
field.nullable:
+columns_with_null_int.add(field.name)
+row_handler = null_handler
+pandas_type = np.float32
--- End diff --

Ah, I see where I got confused. I had started with @ueshin 's suggestion 
but abandoned it because I didn't want to create the DataFrame before the type 
correction because I was also looking at @HyukjinKwon 's suggestion. I somehow 
ended up combining them incorrectly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18945
  
**[Test build #82062 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82062/testReport)**
 for PR 18945 at commit 
[`6e248dd`](https://github.com/apache/spark/commit/6e248ddf96122910468a3f20125ff4fc9f32f299).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140414202
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = set()
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
+if abs(val) > 16777216: # Max value before 
np.float32 loses precision.
+val = np.float64(val)
+dt = np.float64
+dtype[column] = np.float64
+else:
+val = np.float32(val)
+if dt not in (np.float32, np.float64):
+dt = np.float32
+dtype[column] = np.float32
+row[column] = val
+row = Row(**row)
+yield row
+row_handler = lambda x,y: x
 for field in self.schema:
 pandas_type = _to_corrected_pandas_type(field.dataType)
+if pandas_type in (np.int8, np.int16, np.int32) and 
field.nullable:
+columns_with_null_int.add(field.name)
+row_handler = null_handler
+pandas_type = np.float32
--- End diff --

A simple wrong for this line is, even this condition is met, don't 
necessarily meaning there are null values in the column. But you forcibly set 
the type to np.float32. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140414042
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = set()
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
+if abs(val) > 16777216: # Max value before 
np.float32 loses precision.
+val = np.float64(val)
+dt = np.float64
+dtype[column] = np.float64
+else:
+val = np.float32(val)
+if dt not in (np.float32, np.float64):
+dt = np.float32
+dtype[column] = np.float32
+row[column] = val
+row = Row(**row)
+yield row
+row_handler = lambda x,y: x
 for field in self.schema:
 pandas_type = _to_corrected_pandas_type(field.dataType)
+if pandas_type in (np.int8, np.int16, np.int32) and 
field.nullable:
+columns_with_null_int.add(field.name)
+row_handler = null_handler
+pandas_type = np.float32
--- End diff --

Have you read carefully the comments in 
https://github.com/apache/spark/pull/18945#discussion_r134033952, 
https://github.com/apache/spark/pull/18945#discussion_r134925269? They are good 
suggestions for this issue. I don't know why you don't want to follow them to 
check null values with Pandas...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread logannc
Github user logannc commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140413579
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = set()
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
+if abs(val) > 16777216: # Max value before 
np.float32 loses precision.
+val = np.float64(val)
+dt = np.float64
+dtype[column] = np.float64
+else:
+val = np.float32(val)
+if dt not in (np.float32, np.float64):
+dt = np.float32
+dtype[column] = np.float32
+row[column] = val
+row = Row(**row)
+yield row
+row_handler = lambda x,y: x
 for field in self.schema:
 pandas_type = _to_corrected_pandas_type(field.dataType)
+if pandas_type in (np.int8, np.int16, np.int32) and 
field.nullable:
+columns_with_null_int.add(field.name)
+row_handler = null_handler
+pandas_type = np.float32
--- End diff --

Can you elaborate? I believe it is, per my reply to your comment in the 
`null_handler`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...

2017-09-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19204


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19204: [SPARK-21981][PYTHON][ML] Added Python interface for Clu...

2017-09-21 Thread yanboliang
Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/19204
  
Merged into master, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18945
  
**[Test build #82061 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82061/testReport)**
 for PR 18945 at commit 
[`14f36c3`](https://github.com/apache/spark/commit/14f36c354f65a34e3e06cd4d35029e5f8f2b79f0).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82061/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18945
  
**[Test build #82061 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82061/testReport)**
 for PR 18945 at commit 
[`14f36c3`](https://github.com/apache/spark/commit/14f36c354f65a34e3e06cd4d35029e5f8f2b79f0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread logannc
Github user logannc commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140412857
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = {}
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
--- End diff --

If `pandas_type in (np.int8, np.int16, np.int32) and field.nullable` and 
there are ANY non-null values, the dtype of the column is changed to 
`np.float32` or `np.float64`, both of which properly handle `None` values.

That said, if the entire column was `None`, it would fail. Therefore I have 
preemptively changed the type on line 1787 to `np.float32`. Per `null_handler`, 
it may still change to `np.float64` if needed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread logannc
Github user logannc commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140412745
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = {}
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
--- End diff --

If `pandas_type in (np.int8, np.int16, np.int32) and field.nullable` and 
there are ANY non-null values, the dtype of the column is changed to 
`np.float32` or `np.float64`, both of which properly handle `None` values.

That said, if the entire column was `None`, it would fail. Therefore I have 
preemptively changed the type on line 1787 to `np.float32`. Per `null_handler`, 
it may still change to `np.float64` if needed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140412632
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = set()
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
+if abs(val) > 16777216: # Max value before 
np.float32 loses precision.
+val = np.float64(val)
+dt = np.float64
+dtype[column] = np.float64
+else:
+val = np.float32(val)
+if dt not in (np.float32, np.float64):
+dt = np.float32
+dtype[column] = np.float32
+row[column] = val
+row = Row(**row)
+yield row
+row_handler = lambda x,y: x
 for field in self.schema:
 pandas_type = _to_corrected_pandas_type(field.dataType)
+if pandas_type in (np.int8, np.int16, np.int32) and 
field.nullable:
+columns_with_null_int.add(field.name)
+row_handler = null_handler
+pandas_type = np.float32
--- End diff --

I don't think this is a correct fix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18945
  
**[Test build #82060 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82060/testReport)**
 for PR 18945 at commit 
[`b313a3b`](https://github.com/apache/spark/commit/b313a3b8fc88898423940f195ab16bd3a57c0061).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18945
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82060/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18945
  
**[Test build #82060 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82060/testReport)**
 for PR 18945 at commit 
[`b313a3b`](https://github.com/apache/spark/commit/b313a3b8fc88898423940f195ab16bd3a57c0061).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19229: [SPARK-22001][ML][SQL] ImputerModel can do withCo...

2017-09-21 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19229#discussion_r140412254
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2102,6 +2102,55 @@ class Dataset[T] private[sql](
   }
 
   /**
+   * Returns a new Dataset by adding columns or replacing the existing 
columns that has
+   * the same names.
+   */
+  private[spark] def withColumns(colNames: Seq[String], cols: 
Seq[Column]): DataFrame = {
--- End diff --

@cloud-fan should look at this `withColumns` before in #17819. cc 
@cloud-fan to see if you has more comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19314: [SPARK-22094][SS]processAllAvailable should check...

2017-09-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19314


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-21 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19229
  
ping @zhengruifeng @WeichenXu123 Any more comments on this? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...

2017-09-21 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/19314
  
Thanks! Merging to master and branch-2.2


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [WIP][SPARK-22063][R] Upgrades lintr to latest commit sh...

2017-09-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19290
  
I initially did this, for example,

```

\href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-
r-and-spark}{Spark Data Types} for available data types.
```

this passes the lint and doc is find but cran check is failed. I actually 
tried to find out the way but ended up with `nolint`. ... will try to read the 
doc one more and out few more cases locally.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [WIP][SPARK-22063][R] Upgrades lintr to latest commit sh...

2017-09-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19290
  
Doh, you mean the current status. Yes, I checked.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release ...

2017-09-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19312#discussion_r140410448
  
--- Diff: dev/create-release/release-build.sh ---
@@ -95,6 +95,28 @@ if [ -z "$SPARK_VERSION" ]; then
 | grep -v INFO | grep -v WARNING | grep -v Download)
 fi
 
+# Verify we have the right java version set
+java_version=$("${JAVA_HOME}"/bin/javac -version 2>&1 | cut -d " " -f 2)
--- End diff --

@holdenk, should we maybe catch the case when `JAVA_HOME` is missing too?

If so, I think we could do something like ...

```bash
if [ -z "$JAVA_HOME" ]; then
  echo "Please set JAVA_HOME."
  exit 1
fi
...
```

Or maybe...

```bash
if [[ -x "$JAVA_HOME/bin/javac" ]]; then
  javac_cmd="$JAVA_HOME/bin/javac"
else
  javac_cmd=javac
fi

java_version=$("$javac_cmd" -version 2>&1 | cut -d " " -f 2)
...
```

I tested both in my local.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19318: [SPARK-22096][ML] use aggregateByKeyLocally in feature f...

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19318
  
**[Test build #82059 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82059/testReport)**
 for PR 19318 at commit 
[`efb0fe9`](https://github.com/apache/spark/commit/efb0fe9c0544d8666c423ba9bde533735961ea75).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19318: [SPARK-22096][ML] use aggregateByKeyLocally in feature f...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19318
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82059/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19318: [SPARK-22096][ML] use aggregateByKeyLocally in feature f...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19318
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [WIP][SPARK-22063][R] Upgrades lintr to latest commit sh...

2017-09-21 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/19290
  
btw, could you check if haven't already, if `nolint` around the `http` 
link, roxygen is going to handle that correctly?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19318: [SPARK-22096][ML] use aggregateByKeyLocally in feature f...

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19318
  
**[Test build #82059 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82059/testReport)**
 for PR 19318 at commit 
[`efb0fe9`](https://github.com/apache/spark/commit/efb0fe9c0544d8666c423ba9bde533735961ea75).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19318: [SPARK-22096][ML] use aggregateByKeyLocally in fe...

2017-09-21 Thread VinceShieh
GitHub user VinceShieh opened a pull request:

https://github.com/apache/spark/pull/19318

[SPARK-22096][ML] use aggregateByKeyLocally in feature frequency calc…

## What changes were proposed in this pull request?

NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
frequency for each feature/label. We can implement a new function 
'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
sending results to a reducer to save one stage.

We tested on NaiveBayes and see ~20% performance gain with these changes.

Signed-off-by: Vincent Xie 
## How was this patch tested?
existing test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/VinceShieh/spark SPARK-22096

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19318.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19318


commit efb0fe9c0544d8666c423ba9bde533735961ea75
Author: Vincent Xie 
Date:   2017-09-22T03:57:08Z

[SPARK-22096][ML] use aggregateByKeyLocally in feature frequency calculation

NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
frequency for each feature/label. We can implement a new function 
'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
sending results to a reducer to save one stage.

We tested on NaiveBayes and see ~20% performance gain with these changes.

Signed-off-by: Vincent Xie 




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19122
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19122
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82058/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19122
  
**[Test build #82058 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82058/testReport)**
 for PR 19122 at commit 
[`3464dfe`](https://github.com/apache/spark/commit/3464dfea1f008e945a5e608b593877d1cbdf0e35).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19317
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-21 Thread ConeyLiu
Github user ConeyLiu commented on the issue:

https://github.com/apache/spark/pull/19317
  
cc @VinceShieh 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19317: [SPARK-22098][CORE] Add new method aggregateByKey...

2017-09-21 Thread ConeyLiu
GitHub user ConeyLiu opened a pull request:

https://github.com/apache/spark/pull/19317

[SPARK-22098][CORE] Add new method aggregateByKeyLocally in RDD

## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-22096

NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
frequency for each feature/label. We can implement a new function 
'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
sending results to a reducer to save one stage.
We tested on NaiveBayes and see ~20% performance gain with these changes.

This is a subtask of our improvement.

## How was this patch tested?

New UT.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ConeyLiu/spark aggregatebykeylocally

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19317.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19317


commit 73a85dc5963ac46f181a9499deabb18da4ccc308
Author: Xianyang Liu 
Date:   2017-08-31T05:16:09Z

add new method 'aggregateByKeyLocally'




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19316: [SPARK-22097][CORE]Call serializationStream.close after ...

2017-09-21 Thread ConeyLiu
Github user ConeyLiu commented on the issue:

https://github.com/apache/spark/pull/19316
  
@cloud-fan Pls take a look. Thanks a lot.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release build s...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19312
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82056/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release build s...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19312
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19316: [SPARK-22097][CORE]Call serializationStream.close...

2017-09-21 Thread ConeyLiu
Github user ConeyLiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19316#discussion_r140408246
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala ---
@@ -387,11 +387,18 @@ private[spark] class MemoryStore(
 // the block's actual memory usage has exceeded the unroll memory by a 
small amount, so we
 // perform one final call to attempt to allocate additional memory if 
necessary.
 if (keepUnrolling) {
-  serializationStream.close()
-  reserveAdditionalMemoryIfNecessary()
+  serializationStream.flush()
+  if (bbos.size > unrollMemoryUsedByThisBlock) {
+val amountToRequest = bbos.size - unrollMemoryUsedByThisBlock
+keepUnrolling = reserveUnrollMemoryForThisTask(blockId, 
amountToRequest, memoryMode)
+if (keepUnrolling) {
+  unrollMemoryUsedByThisBlock += amountToRequest
+}
+  }
 }
 
 if (keepUnrolling) {
+  serializationStream.close()
--- End diff --

Here, we should close the `serializationStream` after we check it again. 
Previous code we close it first, and then request the exceed memory. So there 
is a potential problem that we can't request enought memory, while the 
`serializationStream` is closeed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release build s...

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19312
  
**[Test build #82056 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82056/testReport)**
 for PR 19312 at commit 
[`aa4cbf6`](https://github.com/apache/spark/commit/aa4cbf69b080435bc836dc9820307fba6588).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19316: [SPARK-22097][CORE]Call serializationStream.close...

2017-09-21 Thread ConeyLiu
Github user ConeyLiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19316#discussion_r140408116
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala ---
@@ -387,11 +387,18 @@ private[spark] class MemoryStore(
 // the block's actual memory usage has exceeded the unroll memory by a 
small amount, so we
 // perform one final call to attempt to allocate additional memory if 
necessary.
 if (keepUnrolling) {
-  serializationStream.close()
-  reserveAdditionalMemoryIfNecessary()
+  serializationStream.flush()
+  if (bbos.size > unrollMemoryUsedByThisBlock) {
+val amountToRequest = bbos.size - unrollMemoryUsedByThisBlock
--- End diff --

Here, we only need request the `bbos.size - unrollMemoryUsedByThisBlock`. 
I'm sorry, this mistake maybe introduced by my previous patch 
[SPARK-21923](https://github.com/apache/spark/pull/19135).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19316: [SPARK-22097][CORE]Call serializationStream.close after ...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19316
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19316: [SPARK-22097][CORE]Call serializationStream.close...

2017-09-21 Thread ConeyLiu
GitHub user ConeyLiu opened a pull request:

https://github.com/apache/spark/pull/19316

[SPARK-22097][CORE]Call serializationStream.close after we requested enough 
memory

## What changes were proposed in this pull request?


Current code, we close the `serializationStream` after we unrolled the 
block. However, there is a otential problem that the size of underlying vector 
or stream maybe larger the memory we requested. So here, we need check it agin 
carefully.

## How was this patch tested?

Existing UT.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ConeyLiu/spark putIteratorAsBytes

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19316.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19316


commit bfe162e3aad300414dcc3fe25a3d70025e1795dd
Author: Xianyang Liu 
Date:   2017-09-22T03:29:39Z

close the serializationStream after check the memory request




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19168: [SPARK-21956][CORE] Fetch up to max bytes when buf reall...

2017-09-21 Thread caneGuy
Github user caneGuy commented on the issue:

https://github.com/apache/spark/pull/19168
  
Sorry for replying so late.
I add some benchmark testing for this pr @kiszk .
And @jerryshao could you help review this pr?Thanks
```
Running benchmark: Benchmark fetch before vs after releasing buffer
  Running case: Testing fetch before releasing!
  Stopped after 10 iterations, 2423 ms
  Running case: Testing fetch after releasing!
  Stopped after 18 iterations, 2036 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17 on Linux 4.4.0-64-generic
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Benchmark fetch before vs after releasing buffer: Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


Testing fetch before releasing! 46 /  242345.0  
 2.9   1.0X
Testing fetch after releasing!  73 /  113215.7  
 4.6   0.6X
```
```
Running benchmark: Benchmark fetch before vs after releasing buffer
  Running case: Testing fetch before releasing!
  Stopped after 10 iterations, 3888 ms
  Running case: Testing fetch after releasing!
  Stopped after 10 iterations, 3970 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17 on Linux 4.4.0-64-generic
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Benchmark fetch before vs after releasing buffer: Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


Testing fetch before releasing!100 /  389157.8  
 6.3   1.0X
Testing fetch after releasing! 151 /  397104.3  
 9.6   0.7X
```
```
Running benchmark: Benchmark fetch before vs after releasing buffer
  Running case: Testing fetch before releasing!
  Stopped after 15 iterations, 2016 ms
  Running case: Testing fetch after releasing!
  Stopped after 14 iterations, 2110 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17 on Linux 4.4.0-64-generic
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Benchmark fetch before vs after releasing buffer: Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


Testing fetch before releasing! 43 /  134363.8  
 2.7   1.0X
Testing fetch after releasing!  99 /  151158.1  
 6.3   0.4X
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19278
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19278
  
**[Test build #82057 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82057/testReport)**
 for PR 19278 at commit 
[`8f78f59`](https://github.com/apache/spark/commit/8f78f596473877f3e8a0169f998f16a6bf1a8f5a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19278
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82057/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19278
  
@jkbradley Sure I tested the backwards compatibility. Part of the reason I 
changed into `DefaultParamReader.getAndSetParams` is for backwards 
compatibility.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19122
  
**[Test build #82058 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82058/testReport)**
 for PR 19122 at commit 
[`3464dfe`](https://github.com/apache/spark/commit/3464dfea1f008e945a5e608b593877d1cbdf0e35).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19122#discussion_r140402700
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -836,6 +836,27 @@ def test_save_load_simple_estimator(self):
 loadedModel = CrossValidatorModel.load(cvModelPath)
 self.assertEqual(loadedModel.bestModel.uid, cvModel.bestModel.uid)
 
+def test_parallel_evaluation(self):
+dataset = self.spark.createDataFrame(
+[(Vectors.dense([0.0]), 0.0),
+ (Vectors.dense([0.4]), 1.0),
+ (Vectors.dense([0.5]), 0.0),
+ (Vectors.dense([0.6]), 1.0),
+ (Vectors.dense([1.0]), 1.0)] * 10,
+["features", "label"])
+
+lr = LogisticRegression()
+grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
+evaluator = BinaryClassificationEvaluator()
+
+# test save/load of CrossValidator
+cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
evaluator=evaluator)
+cv.setParallelism(1)
+cvSerialModel = cv.fit(dataset)
+cv.setParallelism(2)
+cvParallelModel = cv.fit(dataset)
+self.assertEqual(sorted(cvSerialModel.avgMetrics), 
sorted(cvParallelModel.avgMetrics))
--- End diff --

hmm... I tried. But how to get model parents ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19315: Updated english.txt word ordering

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19315
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19315: Updated english.txt word ordering

2017-09-21 Thread animenon
GitHub user animenon opened a pull request:

https://github.com/apache/spark/pull/19315

Updated english.txt word ordering

Ordered alphabetically, for better readability.

## What changes were proposed in this pull request?

Alphabetical ordering of the stop words.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/animenon/spark patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19315.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19315


commit 57c282721c63487a82bdd6959c6ff5f6ce9f66ad
Author: Anirudh 
Date:   2017-09-22T02:40:30Z

Updated english.txt word ordering

Ordered alphabetically, for better readability.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19314
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82055/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19314
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19314
  
**[Test build #82055 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82055/testReport)**
 for PR 19314 at commit 
[`a4a02a6`](https://github.com/apache/spark/commit/a4a02a69bf41906c03e46c50d0eca75d6844465a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13794: [SPARK-15574][ML][PySpark] Python meta-algorithms in Sca...

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/13794
  
cc @srowen Can you help close this ? We won't need this feature for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19279: [SPARK-22061] [ML]add pipeline model of SVM

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19279
  
cc @srowen Can you help close this ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19278
  
**[Test build #82057 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82057/testReport)**
 for PR 19278 at commit 
[`8f78f59`](https://github.com/apache/spark/commit/8f78f596473877f3e8a0169f998f16a6bf1a8f5a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19278#discussion_r140398933
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala ---
@@ -159,12 +159,15 @@ class CrossValidatorSuite
   .setEvaluator(evaluator)
   .setNumFolds(20)
   .setEstimatorParamMaps(paramMaps)
+  .setSeed(42L)
+  .setParallelism(2)
--- End diff --

ditto.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19278#discussion_r140398857
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala 
---
@@ -160,11 +160,13 @@ class TrainValidationSplitSuite
   .setTrainRatio(0.5)
   .setEstimatorParamMaps(paramMaps)
   .setSeed(42L)
+  .setParallelism(2)
--- End diff --

No. The model do not own `parallel` parameter. This was discussed before.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19281: [SPARK-21998][SQL] SortMergeJoinExec did not calculate i...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19281
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19281: [SPARK-21998][SQL] SortMergeJoinExec did not calculate i...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19281
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82054/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19281: [SPARK-21998][SQL] SortMergeJoinExec did not calculate i...

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19281
  
**[Test build #82054 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82054/testReport)**
 for PR 19281 at commit 
[`78f41e0`](https://github.com/apache/spark/commit/78f41e046ea4e307e43315b23ed3212e1f4c3f1b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19293: [SPARK-22079][SQL] Serializer in HiveOutputWriter miss l...

2017-09-21 Thread LantaoJin
Github user LantaoJin commented on the issue:

https://github.com/apache/spark/pull/19293
  
@gatorsmile @cloud-fan Please help to review


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDF...

2017-09-21 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18659#discussion_r140396700
  
--- Diff: python/pyspark/serializers.py ---
@@ -199,6 +211,55 @@ def __repr__(self):
 return "ArrowSerializer"
 
 
+class ArrowPandasSerializer(ArrowSerializer):
+"""
+Serializes Pandas.Series as Arrow data.
+"""
+
+def __init__(self):
+super(ArrowPandasSerializer, self).__init__()
--- End diff --

Do we need this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19301: [SPARK-22084][SQL] Fix performance regression in aggrega...

2017-09-21 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/19301
  
@cenyuhai This  is an optimize for physical plan, and your case can be 
optimized.

```SQL
select dt,
geohash_of_latlng,
sum(mt_cnt),
sum(ele_cnt),
round(sum(mt_cnt) * 1.0 * 100 / sum(mt_cnt_all), 2),
round(sum(ele_cnt) * 1.0 * 100 / sum(ele_cnt_all), 2)
from values(1, 2, 3, 4, 5, 6) as (dt, geohash_of_latlng, mt_cnt, ele_cnt, 
mt_cnt_all, ele_cnt_all)
group by dt, geohash_of_latlng
order by dt, geohash_of_latlng limit 10
```

Before:

```
== Physical Plan ==
TakeOrderedAndProject(limit=10, orderBy=[dt#26 ASC NULLS 
FIRST,geohash_of_latlng#27 ASC NULLS FIRST], 
output=[dt#26,geohash_of_latlng#27,sum(mt_cnt)#38L,sum(ele_cnt)#39L,round((CAST((CAST((CAST(CAST(sum(CAST(mt_cnt
 AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) 
AS DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS 
DECIMAL(38,2)) / CAST(CAST(sum(CAST(mt_cnt_all AS BIGINT)) AS DECIMAL(20,0)) AS 
DECIMAL(38,2))), 2)#40,round((CAST((CAST((CAST(CAST(sum(CAST(ele_cnt AS 
BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) AS 
DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS 
DECIMAL(38,2)) / CAST(CAST(sum(CAST(ele_cnt_all AS BIGINT)) AS DECIMAL(20,0)) 
AS DECIMAL(38,2))), 2)#41])
+- *HashAggregate(keys=[dt#26, geohash_of_latlng#27], 
functions=[sum(cast(mt_cnt#28 as bigint)), sum(cast(ele_cnt#29 as bigint)), 
sum(cast(mt_cnt#28 as bigint)), sum(cast(mt_cnt_all#30 as bigint)), 
sum(cast(ele_cnt#29 as bigint)), sum(cast(ele_cnt_all#31 as bigint))])
   +- Exchange hashpartitioning(dt#26, geohash_of_latlng#27, 200)
  +- *HashAggregate(keys=[dt#26, geohash_of_latlng#27], 
functions=[partial_sum(cast(mt_cnt#28 as bigint)), partial_sum(cast(ele_cnt#29 
as bigint)), partial_sum(cast(mt_cnt#28 as bigint)), 
partial_sum(cast(mt_cnt_all#30 as bigint)), partial_sum(cast(ele_cnt#29 as 
bigint)), partial_sum(cast(ele_cnt_all#31 as bigint))])
 +- LocalTableScan [dt#26, geohash_of_latlng#27, mt_cnt#28, 
ele_cnt#29, mt_cnt_all#30, ele_cnt_all#31]
```

After:

```
== Physical Plan ==
TakeOrderedAndProject(limit=10, orderBy=[dt#28 ASC NULLS 
FIRST,geohash_of_latlng#29 ASC NULLS FIRST], 
output=[dt#28,geohash_of_latlng#29,sum(mt_cnt)#34L,sum(ele_cnt)#35L,round((CAST((CAST((CAST(CAST(sum(CAST(mt_cnt
 AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) 
AS DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS 
DECIMAL(38,2)) / CAST(CAST(sum(CAST(mt_cnt_all AS BIGINT)) AS DECIMAL(20,0)) AS 
DECIMAL(38,2))), 2)#36,round((CAST((CAST((CAST(CAST(sum(CAST(ele_cnt AS 
BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) AS 
DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS 
DECIMAL(38,2)) / CAST(CAST(sum(CAST(ele_cnt_all AS BIGINT)) AS DECIMAL(20,0)) 
AS DECIMAL(38,2))), 2)#37])
+- *HashAggregate(keys=[dt#28, geohash_of_latlng#29], 
functions=[sum(cast(mt_cnt#30 as bigint)), sum(cast(ele_cnt#31 as bigint)), 
sum(cast(mt_cnt_all#32 as bigint)), sum(cast(ele_cnt_all#33 as bigint))])
   +- Exchange hashpartitioning(dt#28, geohash_of_latlng#29, 200)
  +- *HashAggregate(keys=[dt#28, geohash_of_latlng#29], 
functions=[partial_sum(cast(mt_cnt#30 as bigint)), partial_sum(cast(ele_cnt#31 
as bigint)), partial_sum(cast(mt_cnt_all#32 as bigint)), 
partial_sum(cast(ele_cnt_all#33 as bigint))])
 +- LocalTableScan [dt#28, geohash_of_latlng#29, mt_cnt#30, 
ele_cnt#31, mt_cnt_all#32, ele_cnt_all#33]
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18659
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82053/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18659
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18659
  
**[Test build #82053 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82053/testReport)**
 for PR 18659 at commit 
[`b8ffa50`](https://github.com/apache/spark/commit/b8ffa50132d0290c0796fb99eb37fe010f56a182).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release build s...

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19312
  
**[Test build #82056 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82056/testReport)**
 for PR 19312 at commit 
[`aa4cbf6`](https://github.com/apache/spark/commit/aa4cbf69b080435bc836dc9820307fba6588).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release build s...

2017-09-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19312
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19194: [SPARK-20589] Allow limiting task concurrency per stage

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19194
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19194: [SPARK-20589] Allow limiting task concurrency per stage

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19194
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82052/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19194: [SPARK-20589] Allow limiting task concurrency per stage

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19194
  
**[Test build #82052 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82052/testReport)**
 for PR 19194 at commit 
[`dcfb861`](https://github.com/apache/spark/commit/dcfb86152c36032e9ec4f1545fb84c6f24287423).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19269: [SPARK-22026][SQL][WIP] data source v2 write path

2017-09-21 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/19269#discussion_r140389681
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java
 ---
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.writer;
+
+import org.apache.spark.annotation.InterfaceStability;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.sources.v2.DataSourceV2Options;
+import org.apache.spark.sql.sources.v2.WriteSupport;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * A data source writer that is returned by
+ * {@link WriteSupport#createWriter(StructType, SaveMode, 
DataSourceV2Options)}.
+ * It can mix in various writing optimization interfaces to speed up the 
data saving. The actual
+ * writing logic is delegated to {@link WriteTask} that is returned by 
{@link #createWriteTask()}.
+ *
+ * The writing procedure is:
+ *   1. Create a write task by {@link #createWriteTask()}, serialize and 
send it to all the
+ *  partitions of the input data(RDD).
+ *   2. For each partition, create a data writer with the write task, and 
write the data of the
--- End diff --

Maybe you're right. If a source supports rollback for sequential tasks, 
then that might mean it has to support rollback for concurrent tasks. I was 
originally thinking of a case like JDBC without transactions. So an insert 
actually creates the rows and rolling back concurrent tasks would delete rows 
from the other task. But in that case, inserts are idempotent so it wouldn't 
matter. I'm not sure if there's a case where you can (or would) implement 
rolling back, but can't handle concurrency. Lets just leave it until someone 
has a use case for it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDF...

2017-09-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18659#discussion_r140389432
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2142,18 +2159,26 @@ def udf(f=None, returnType=StringType()):
 | 8|  JOHN DOE|  22|
 +--+--++
 """
-def _udf(f, returnType=StringType()):
-udf_obj = UserDefinedFunction(f, returnType)
-return udf_obj._wrapped()
+return _create_udf(f, returnType=returnType, vectorized=False)
 
-# decorator @udf, @udf() or @udf(dataType())
-if f is None or isinstance(f, (str, DataType)):
-# If DataType has been passed as a positional argument
-# for decorator use it as a returnType
-return_type = f or returnType
-return functools.partial(_udf, returnType=return_type)
+
+@since(2.3)
+def pandas_udf(f=None, returnType=StringType()):
+"""
+Creates a :class:`Column` expression representing a user defined 
function (UDF) that accepts
+`Pandas.Series` as input arguments and outputs a `Pandas.Series` of 
the same length.
+
+:param f: python function if used as a standalone function
+:param returnType: a :class:`pyspark.sql.types.DataType` object
+
+# TODO: doctest
+"""
+import inspect
+# If function "f" does not define the optional kwargs, then wrap with 
a kwargs placeholder
+if inspect.getargspec(f).keywords is None:
+return _create_udf(lambda *a, **kwargs: f(*a), 
returnType=returnType, vectorized=True)
--- End diff --

I'm fine to disallow 0-parameter pandas udf, as it's not a common case. We 
can add it when people request it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...

2017-09-21 Thread marmbrus
Github user marmbrus commented on the issue:

https://github.com/apache/spark/pull/19314
  
LGTM!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19314
  
**[Test build #82055 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82055/testReport)**
 for PR 19314 at commit 
[`a4a02a6`](https://github.com/apache/spark/commit/a4a02a69bf41906c03e46c50d0eca75d6844465a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...

2017-09-21 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/19314
  
cc @marmbrus 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19314: [SPARK-22094][SS]processAllAvailable should check...

2017-09-21 Thread zsxwing
GitHub user zsxwing opened a pull request:

https://github.com/apache/spark/pull/19314

[SPARK-22094][SS]processAllAvailable should check the query state

## What changes were proposed in this pull request?

`processAllAvailable` should also check the query state and if the query is 
stopped, it should return.

## How was this patch tested?

The new unit test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zsxwing/spark SPARK-22094

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19314.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19314


commit b222aef2179a36a8e2b0c3556ec60b2bfb53e228
Author: Shixiong Zhu 
Date:   2017-09-22T00:02:02Z

processAllAvailable should check the query state




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-21 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/19294
  
+CC @weiqingy 
Can you try this PR with SHC and see if it works ?
That is, remove your current workaround for SPARK-21549 from SHC and try 
writing to hbase with a spark version patched with this PR [1]
That will allow us to have a real world test, and possibly surface issues.

@szhem incorporating a test for the sql part will also help in this matter.


[1] SHC will need the workaround even if this issue is resolved (since 2.2 
has been released with this bug).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...

2017-09-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19122#discussion_r140375759
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -836,6 +836,27 @@ def test_save_load_simple_estimator(self):
 loadedModel = CrossValidatorModel.load(cvModelPath)
 self.assertEqual(loadedModel.bestModel.uid, cvModel.bestModel.uid)
 
+def test_parallel_evaluation(self):
+dataset = self.spark.createDataFrame(
+[(Vectors.dense([0.0]), 0.0),
+ (Vectors.dense([0.4]), 1.0),
+ (Vectors.dense([0.5]), 0.0),
+ (Vectors.dense([0.6]), 1.0),
+ (Vectors.dense([1.0]), 1.0)] * 10,
+["features", "label"])
+
+lr = LogisticRegression()
+grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
+evaluator = BinaryClassificationEvaluator()
+
+# test save/load of CrossValidator
+cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
evaluator=evaluator)
+cv.setParallelism(1)
+cvSerialModel = cv.fit(dataset)
+cv.setParallelism(2)
+cvParallelModel = cv.fit(dataset)
+self.assertEqual(sorted(cvSerialModel.avgMetrics), 
sorted(cvParallelModel.avgMetrics))
--- End diff --

Don't sort the metrics.  The metrics are guaranteed to be returned in the 
same order as the estimatorParamMaps, so they should match up already.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...

2017-09-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19122#discussion_r140375921
  
--- Diff: python/pyspark/ml/tuning.py ---
@@ -14,15 +14,16 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-
 import itertools
 import numpy as np
 
+from multiprocessing.pool import ThreadPool
--- End diff --

style: This should be grouped with the other 3rd-party library imports


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >