[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82065/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19315: Updated english.txt word ordering
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19315 @animenon can you please fix the PR title like what other PR did. Also is this only for better readability or do you fix any other issue? IMO, I found that previous txt is more readable than your change, since they're ordered by different kind. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide jump link...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18015 **[Test build #82064 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82064/testReport)** for PR 18015 at commit [`21e2c31`](https://github.com/apache/spark/commit/21e2c31369b2223d0bee16b9bc98373ab0ec59a9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user logannc commented on the issue: https://github.com/apache/spark/pull/18945 I've continued to use @HyukjinKwon 's suggestion because it should be more performant and is capable of handling it without loss of precision. I believe I've addressed your concerns by only changing the type when we encounter a null (duh). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18945 **[Test build #82063 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82063/testReport)** for PR 18945 at commit [`bd25923`](https://github.com/apache/spark/commit/bd259239c550b0b19311968aff9a69da29a6a05e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19301: [SPARK-22084][SQL] Fix performance regression in ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19301#discussion_r140416279 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala --- @@ -72,11 +74,19 @@ object AggregateExpression { aggregateFunction: AggregateFunction, mode: AggregateMode, isDistinct: Boolean): AggregateExpression = { +val state = if (aggregateFunction.resolved) { + Seq(aggregateFunction.toString, aggregateFunction.dataType, +aggregateFunction.nullable, mode, isDistinct) +} else { + Seq(aggregateFunction.toString, mode, isDistinct) +} +val hashCode = state.map(Objects.hashCode).foldLeft(0)((a, b) => 31 * a + b) + AggregateExpression( aggregateFunction, mode, isDistinct, - NamedExpression.newExprId) + ExprId(hashCode)) --- End diff -- I don't think this is the right fix. Semantically the `b0` and `b1` in `SELECT SUM(b) AS b0, SUM(b) AS b1 ` are different aggregate functions, so they should have different `resultId`. It's kind of an optimization in aggregate planner, we should detect these semantically different but duplicated aggregate functions and only plan one aggrega function. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide jump link...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/18015 Yes, I'm fine with it. @ajbozarth would you please take another look on this PR? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide jump link...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/18015 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user DaimonPl commented on the issue: https://github.com/apache/spark/pull/16578 @mallman how about adding comment explaining why such workaround was done + bug number in parquet-mr ? So in future once that bug is fixed, code can be cleaned. Also maybe it's time to remove "DO NOT MERGE" from title? As I understand most of comments were addressed :) Thank you very much for work on this feature. I must admit that we are looking forward to have this merged. For us this will be most important improvement in Spark 2.3.0 (I hope it will be part of 2.3.0 :) ) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide ju...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/18015#discussion_r140416046 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala --- @@ -61,7 +59,37 @@ private[ui] class AllExecutionsPage(parent: SQLTab) extends WebUIPage("") with L details.parentNode.querySelector('.stage-details').classList.toggle('collapsed') }} -UIUtils.headerSparkPage("SQL", content, parent, Some(5000)) + +val summary: NodeSeq = + + + { + if (listener.getRunningExecutions.nonEmpty) { + + Running Queries: + {listener.getRunningExecutions.size} + + } + } + { + if (listener.getCompletedExecutions.nonEmpty) { + + Completed Queries: + {listener.getCompletedExecutions.size} + + } + } + { + if (listener.getFailedExecutions.nonEmpty) { + + Failed Queries: + {listener.getFailedExecutions.size} + + } + } --- End diff -- Is the indention here correct? This seems a little weird to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82062/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18945 **[Test build #82062 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82062/testReport)** for PR 18945 at commit [`6e248dd`](https://github.com/apache/spark/commit/6e248ddf96122910468a3f20125ff4fc9f32f299). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140415073 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = set() +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. +val = np.float64(val) +dt = np.float64 +dtype[column] = np.float64 +else: +val = np.float32(val) +if dt not in (np.float32, np.float64): +dt = np.float32 +dtype[column] = np.float32 +row[column] = val +row = Row(**row) +yield row +row_handler = lambda x,y: x for field in self.schema: pandas_type = _to_corrected_pandas_type(field.dataType) +if pandas_type in (np.int8, np.int16, np.int32) and field.nullable: +columns_with_null_int.add(field.name) +row_handler = null_handler +pandas_type = np.float32 --- End diff -- I will take my suggestion back. I think thier suggestions are better than mine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user logannc commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140414783 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = set() +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. +val = np.float64(val) +dt = np.float64 +dtype[column] = np.float64 +else: +val = np.float32(val) +if dt not in (np.float32, np.float64): +dt = np.float32 +dtype[column] = np.float32 +row[column] = val +row = Row(**row) +yield row +row_handler = lambda x,y: x for field in self.schema: pandas_type = _to_corrected_pandas_type(field.dataType) +if pandas_type in (np.int8, np.int16, np.int32) and field.nullable: +columns_with_null_int.add(field.name) +row_handler = null_handler +pandas_type = np.float32 --- End diff -- Ah, I see where I got confused. I had started with @ueshin 's suggestion but abandoned it because I didn't want to create the DataFrame before the type correction because I was also looking at @HyukjinKwon 's suggestion. I somehow ended up combining them incorrectly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18945 **[Test build #82062 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82062/testReport)** for PR 18945 at commit [`6e248dd`](https://github.com/apache/spark/commit/6e248ddf96122910468a3f20125ff4fc9f32f299). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140414202 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = set() +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. +val = np.float64(val) +dt = np.float64 +dtype[column] = np.float64 +else: +val = np.float32(val) +if dt not in (np.float32, np.float64): +dt = np.float32 +dtype[column] = np.float32 +row[column] = val +row = Row(**row) +yield row +row_handler = lambda x,y: x for field in self.schema: pandas_type = _to_corrected_pandas_type(field.dataType) +if pandas_type in (np.int8, np.int16, np.int32) and field.nullable: +columns_with_null_int.add(field.name) +row_handler = null_handler +pandas_type = np.float32 --- End diff -- A simple wrong for this line is, even this condition is met, don't necessarily meaning there are null values in the column. But you forcibly set the type to np.float32. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140414042 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = set() +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. +val = np.float64(val) +dt = np.float64 +dtype[column] = np.float64 +else: +val = np.float32(val) +if dt not in (np.float32, np.float64): +dt = np.float32 +dtype[column] = np.float32 +row[column] = val +row = Row(**row) +yield row +row_handler = lambda x,y: x for field in self.schema: pandas_type = _to_corrected_pandas_type(field.dataType) +if pandas_type in (np.int8, np.int16, np.int32) and field.nullable: +columns_with_null_int.add(field.name) +row_handler = null_handler +pandas_type = np.float32 --- End diff -- Have you read carefully the comments in https://github.com/apache/spark/pull/18945#discussion_r134033952, https://github.com/apache/spark/pull/18945#discussion_r134925269? They are good suggestions for this issue. I don't know why you don't want to follow them to check null values with Pandas... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user logannc commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140413579 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = set() +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. +val = np.float64(val) +dt = np.float64 +dtype[column] = np.float64 +else: +val = np.float32(val) +if dt not in (np.float32, np.float64): +dt = np.float32 +dtype[column] = np.float32 +row[column] = val +row = Row(**row) +yield row +row_handler = lambda x,y: x for field in self.schema: pandas_type = _to_corrected_pandas_type(field.dataType) +if pandas_type in (np.int8, np.int16, np.int32) and field.nullable: +columns_with_null_int.add(field.name) +row_handler = null_handler +pandas_type = np.float32 --- End diff -- Can you elaborate? I believe it is, per my reply to your comment in the `null_handler`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19204 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19204: [SPARK-21981][PYTHON][ML] Added Python interface for Clu...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/19204 Merged into master, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18945 **[Test build #82061 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82061/testReport)** for PR 18945 at commit [`14f36c3`](https://github.com/apache/spark/commit/14f36c354f65a34e3e06cd4d35029e5f8f2b79f0). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82061/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18945 **[Test build #82061 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82061/testReport)** for PR 18945 at commit [`14f36c3`](https://github.com/apache/spark/commit/14f36c354f65a34e3e06cd4d35029e5f8f2b79f0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user logannc commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140412857 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = {} +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: --- End diff -- If `pandas_type in (np.int8, np.int16, np.int32) and field.nullable` and there are ANY non-null values, the dtype of the column is changed to `np.float32` or `np.float64`, both of which properly handle `None` values. That said, if the entire column was `None`, it would fail. Therefore I have preemptively changed the type on line 1787 to `np.float32`. Per `null_handler`, it may still change to `np.float64` if needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user logannc commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140412745 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = {} +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: --- End diff -- If `pandas_type in (np.int8, np.int16, np.int32) and field.nullable` and there are ANY non-null values, the dtype of the column is changed to `np.float32` or `np.float64`, both of which properly handle `None` values. That said, if the entire column was `None`, it would fail. Therefore I have preemptively changed the type on line 1787 to `np.float32`. Per `null_handler`, it may still change to `np.float64` if needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140412632 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = set() +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. +val = np.float64(val) +dt = np.float64 +dtype[column] = np.float64 +else: +val = np.float32(val) +if dt not in (np.float32, np.float64): +dt = np.float32 +dtype[column] = np.float32 +row[column] = val +row = Row(**row) +yield row +row_handler = lambda x,y: x for field in self.schema: pandas_type = _to_corrected_pandas_type(field.dataType) +if pandas_type in (np.int8, np.int16, np.int32) and field.nullable: +columns_with_null_int.add(field.name) +row_handler = null_handler +pandas_type = np.float32 --- End diff -- I don't think this is a correct fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18945 **[Test build #82060 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82060/testReport)** for PR 18945 at commit [`b313a3b`](https://github.com/apache/spark/commit/b313a3b8fc88898423940f195ab16bd3a57c0061). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82060/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18945 **[Test build #82060 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82060/testReport)** for PR 18945 at commit [`b313a3b`](https://github.com/apache/spark/commit/b313a3b8fc88898423940f195ab16bd3a57c0061). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19229: [SPARK-22001][ML][SQL] ImputerModel can do withCo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19229#discussion_r140412254 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -2102,6 +2102,55 @@ class Dataset[T] private[sql]( } /** + * Returns a new Dataset by adding columns or replacing the existing columns that has + * the same names. + */ + private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = { --- End diff -- @cloud-fan should look at this `withColumns` before in #17819. cc @cloud-fan to see if you has more comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19314: [SPARK-22094][SS]processAllAvailable should check...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19314 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19229 ping @zhengruifeng @WeichenXu123 Any more comments on this? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/19314 Thanks! Merging to master and branch-2.2 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [WIP][SPARK-22063][R] Upgrades lintr to latest commit sh...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19290 I initially did this, for example, ``` \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between- r-and-spark}{Spark Data Types} for available data types. ``` this passes the lint and doc is find but cran check is failed. I actually tried to find out the way but ended up with `nolint`. ... will try to read the doc one more and out few more cases locally. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [WIP][SPARK-22063][R] Upgrades lintr to latest commit sh...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19290 Doh, you mean the current status. Yes, I checked. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19312#discussion_r140410448 --- Diff: dev/create-release/release-build.sh --- @@ -95,6 +95,28 @@ if [ -z "$SPARK_VERSION" ]; then | grep -v INFO | grep -v WARNING | grep -v Download) fi +# Verify we have the right java version set +java_version=$("${JAVA_HOME}"/bin/javac -version 2>&1 | cut -d " " -f 2) --- End diff -- @holdenk, should we maybe catch the case when `JAVA_HOME` is missing too? If so, I think we could do something like ... ```bash if [ -z "$JAVA_HOME" ]; then echo "Please set JAVA_HOME." exit 1 fi ... ``` Or maybe... ```bash if [[ -x "$JAVA_HOME/bin/javac" ]]; then javac_cmd="$JAVA_HOME/bin/javac" else javac_cmd=javac fi java_version=$("$javac_cmd" -version 2>&1 | cut -d " " -f 2) ... ``` I tested both in my local. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19318: [SPARK-22096][ML] use aggregateByKeyLocally in feature f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19318 **[Test build #82059 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82059/testReport)** for PR 19318 at commit [`efb0fe9`](https://github.com/apache/spark/commit/efb0fe9c0544d8666c423ba9bde533735961ea75). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19318: [SPARK-22096][ML] use aggregateByKeyLocally in feature f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19318 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82059/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19318: [SPARK-22096][ML] use aggregateByKeyLocally in feature f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19318 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [WIP][SPARK-22063][R] Upgrades lintr to latest commit sh...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/19290 btw, could you check if haven't already, if `nolint` around the `http` link, roxygen is going to handle that correctly? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19318: [SPARK-22096][ML] use aggregateByKeyLocally in feature f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19318 **[Test build #82059 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82059/testReport)** for PR 19318 at commit [`efb0fe9`](https://github.com/apache/spark/commit/efb0fe9c0544d8666c423ba9bde533735961ea75). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19318: [SPARK-22096][ML] use aggregateByKeyLocally in fe...
GitHub user VinceShieh opened a pull request: https://github.com/apache/spark/pull/19318 [SPARK-22096][ML] use aggregateByKeyLocally in feature frequency calc… ## What changes were proposed in this pull request? NaiveBayes currently takes aggreateByKey followed by a collect to calculate frequency for each feature/label. We can implement a new function 'aggregateByKeyLocally' in RDD that merges locally on each mapper before sending results to a reducer to save one stage. We tested on NaiveBayes and see ~20% performance gain with these changes. Signed-off-by: Vincent Xie## How was this patch tested? existing test You can merge this pull request into a Git repository by running: $ git pull https://github.com/VinceShieh/spark SPARK-22096 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19318.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19318 commit efb0fe9c0544d8666c423ba9bde533735961ea75 Author: Vincent Xie Date: 2017-09-22T03:57:08Z [SPARK-22096][ML] use aggregateByKeyLocally in feature frequency calculation NaiveBayes currently takes aggreateByKey followed by a collect to calculate frequency for each feature/label. We can implement a new function 'aggregateByKeyLocally' in RDD that merges locally on each mapper before sending results to a reducer to save one stage. We tested on NaiveBayes and see ~20% performance gain with these changes. Signed-off-by: Vincent Xie --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19122 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19122 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82058/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19122 **[Test build #82058 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82058/testReport)** for PR 19122 at commit [`3464dfe`](https://github.com/apache/spark/commit/3464dfea1f008e945a5e608b593877d1cbdf0e35). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19317 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19317 cc @VinceShieh --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19317: [SPARK-22098][CORE] Add new method aggregateByKey...
GitHub user ConeyLiu opened a pull request: https://github.com/apache/spark/pull/19317 [SPARK-22098][CORE] Add new method aggregateByKeyLocally in RDD ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-22096 NaiveBayes currently takes aggreateByKey followed by a collect to calculate frequency for each feature/label. We can implement a new function 'aggregateByKeyLocally' in RDD that merges locally on each mapper before sending results to a reducer to save one stage. We tested on NaiveBayes and see ~20% performance gain with these changes. This is a subtask of our improvement. ## How was this patch tested? New UT. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ConeyLiu/spark aggregatebykeylocally Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19317.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19317 commit 73a85dc5963ac46f181a9499deabb18da4ccc308 Author: Xianyang LiuDate: 2017-08-31T05:16:09Z add new method 'aggregateByKeyLocally' --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19316: [SPARK-22097][CORE]Call serializationStream.close after ...
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19316 @cloud-fan Pls take a look. Thanks a lot. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release build s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19312 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82056/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release build s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19312 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19316: [SPARK-22097][CORE]Call serializationStream.close...
Github user ConeyLiu commented on a diff in the pull request: https://github.com/apache/spark/pull/19316#discussion_r140408246 --- Diff: core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala --- @@ -387,11 +387,18 @@ private[spark] class MemoryStore( // the block's actual memory usage has exceeded the unroll memory by a small amount, so we // perform one final call to attempt to allocate additional memory if necessary. if (keepUnrolling) { - serializationStream.close() - reserveAdditionalMemoryIfNecessary() + serializationStream.flush() + if (bbos.size > unrollMemoryUsedByThisBlock) { +val amountToRequest = bbos.size - unrollMemoryUsedByThisBlock +keepUnrolling = reserveUnrollMemoryForThisTask(blockId, amountToRequest, memoryMode) +if (keepUnrolling) { + unrollMemoryUsedByThisBlock += amountToRequest +} + } } if (keepUnrolling) { + serializationStream.close() --- End diff -- Here, we should close the `serializationStream` after we check it again. Previous code we close it first, and then request the exceed memory. So there is a potential problem that we can't request enought memory, while the `serializationStream` is closeed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release build s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19312 **[Test build #82056 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82056/testReport)** for PR 19312 at commit [`aa4cbf6`](https://github.com/apache/spark/commit/aa4cbf69b080435bc836dc9820307fba6588). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19316: [SPARK-22097][CORE]Call serializationStream.close...
Github user ConeyLiu commented on a diff in the pull request: https://github.com/apache/spark/pull/19316#discussion_r140408116 --- Diff: core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala --- @@ -387,11 +387,18 @@ private[spark] class MemoryStore( // the block's actual memory usage has exceeded the unroll memory by a small amount, so we // perform one final call to attempt to allocate additional memory if necessary. if (keepUnrolling) { - serializationStream.close() - reserveAdditionalMemoryIfNecessary() + serializationStream.flush() + if (bbos.size > unrollMemoryUsedByThisBlock) { +val amountToRequest = bbos.size - unrollMemoryUsedByThisBlock --- End diff -- Here, we only need request the `bbos.size - unrollMemoryUsedByThisBlock`. I'm sorry, this mistake maybe introduced by my previous patch [SPARK-21923](https://github.com/apache/spark/pull/19135). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19316: [SPARK-22097][CORE]Call serializationStream.close after ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19316 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19316: [SPARK-22097][CORE]Call serializationStream.close...
GitHub user ConeyLiu opened a pull request: https://github.com/apache/spark/pull/19316 [SPARK-22097][CORE]Call serializationStream.close after we requested enough memory ## What changes were proposed in this pull request? Current code, we close the `serializationStream` after we unrolled the block. However, there is a otential problem that the size of underlying vector or stream maybe larger the memory we requested. So here, we need check it agin carefully. ## How was this patch tested? Existing UT. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ConeyLiu/spark putIteratorAsBytes Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19316.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19316 commit bfe162e3aad300414dcc3fe25a3d70025e1795dd Author: Xianyang LiuDate: 2017-09-22T03:29:39Z close the serializationStream after check the memory request --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19168: [SPARK-21956][CORE] Fetch up to max bytes when buf reall...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19168 Sorry for replying so late. I add some benchmark testing for this pr @kiszk . And @jerryshao could you help review this pr?Thanks ``` Running benchmark: Benchmark fetch before vs after releasing buffer Running case: Testing fetch before releasing! Stopped after 10 iterations, 2423 ms Running case: Testing fetch after releasing! Stopped after 18 iterations, 2036 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17 on Linux 4.4.0-64-generic Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz Benchmark fetch before vs after releasing buffer: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative Testing fetch before releasing! 46 / 242345.0 2.9 1.0X Testing fetch after releasing! 73 / 113215.7 4.6 0.6X ``` ``` Running benchmark: Benchmark fetch before vs after releasing buffer Running case: Testing fetch before releasing! Stopped after 10 iterations, 3888 ms Running case: Testing fetch after releasing! Stopped after 10 iterations, 3970 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17 on Linux 4.4.0-64-generic Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz Benchmark fetch before vs after releasing buffer: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative Testing fetch before releasing!100 / 389157.8 6.3 1.0X Testing fetch after releasing! 151 / 397104.3 9.6 0.7X ``` ``` Running benchmark: Benchmark fetch before vs after releasing buffer Running case: Testing fetch before releasing! Stopped after 15 iterations, 2016 ms Running case: Testing fetch after releasing! Stopped after 14 iterations, 2110 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17 on Linux 4.4.0-64-generic Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz Benchmark fetch before vs after releasing buffer: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative Testing fetch before releasing! 43 / 134363.8 2.7 1.0X Testing fetch after releasing! 99 / 151158.1 6.3 0.4X ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19278 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19278 **[Test build #82057 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82057/testReport)** for PR 19278 at commit [`8f78f59`](https://github.com/apache/spark/commit/8f78f596473877f3e8a0169f998f16a6bf1a8f5a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19278 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82057/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19278 @jkbradley Sure I tested the backwards compatibility. Part of the reason I changed into `DefaultParamReader.getAndSetParams` is for backwards compatibility. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19122 **[Test build #82058 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82058/testReport)** for PR 19122 at commit [`3464dfe`](https://github.com/apache/spark/commit/3464dfea1f008e945a5e608b593877d1cbdf0e35). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19122#discussion_r140402700 --- Diff: python/pyspark/ml/tests.py --- @@ -836,6 +836,27 @@ def test_save_load_simple_estimator(self): loadedModel = CrossValidatorModel.load(cvModelPath) self.assertEqual(loadedModel.bestModel.uid, cvModel.bestModel.uid) +def test_parallel_evaluation(self): +dataset = self.spark.createDataFrame( +[(Vectors.dense([0.0]), 0.0), + (Vectors.dense([0.4]), 1.0), + (Vectors.dense([0.5]), 0.0), + (Vectors.dense([0.6]), 1.0), + (Vectors.dense([1.0]), 1.0)] * 10, +["features", "label"]) + +lr = LogisticRegression() +grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build() +evaluator = BinaryClassificationEvaluator() + +# test save/load of CrossValidator +cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) +cv.setParallelism(1) +cvSerialModel = cv.fit(dataset) +cv.setParallelism(2) +cvParallelModel = cv.fit(dataset) +self.assertEqual(sorted(cvSerialModel.avgMetrics), sorted(cvParallelModel.avgMetrics)) --- End diff -- hmm... I tried. But how to get model parents ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19315: Updated english.txt word ordering
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19315 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19315: Updated english.txt word ordering
GitHub user animenon opened a pull request: https://github.com/apache/spark/pull/19315 Updated english.txt word ordering Ordered alphabetically, for better readability. ## What changes were proposed in this pull request? Alphabetical ordering of the stop words. You can merge this pull request into a Git repository by running: $ git pull https://github.com/animenon/spark patch-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19315.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19315 commit 57c282721c63487a82bdd6959c6ff5f6ce9f66ad Author: AnirudhDate: 2017-09-22T02:40:30Z Updated english.txt word ordering Ordered alphabetically, for better readability. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19314 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82055/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19314 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19314 **[Test build #82055 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82055/testReport)** for PR 19314 at commit [`a4a02a6`](https://github.com/apache/spark/commit/a4a02a69bf41906c03e46c50d0eca75d6844465a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13794: [SPARK-15574][ML][PySpark] Python meta-algorithms in Sca...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13794 cc @srowen Can you help close this ? We won't need this feature for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19279: [SPARK-22061] [ML]add pipeline model of SVM
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19279 cc @srowen Can you help close this ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19278 **[Test build #82057 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82057/testReport)** for PR 19278 at commit [`8f78f59`](https://github.com/apache/spark/commit/8f78f596473877f3e8a0169f998f16a6bf1a8f5a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19278#discussion_r140398933 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala --- @@ -159,12 +159,15 @@ class CrossValidatorSuite .setEvaluator(evaluator) .setNumFolds(20) .setEstimatorParamMaps(paramMaps) + .setSeed(42L) + .setParallelism(2) --- End diff -- ditto. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19278#discussion_r140398857 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala --- @@ -160,11 +160,13 @@ class TrainValidationSplitSuite .setTrainRatio(0.5) .setEstimatorParamMaps(paramMaps) .setSeed(42L) + .setParallelism(2) --- End diff -- No. The model do not own `parallel` parameter. This was discussed before. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19281: [SPARK-21998][SQL] SortMergeJoinExec did not calculate i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19281 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19281: [SPARK-21998][SQL] SortMergeJoinExec did not calculate i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19281 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82054/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19281: [SPARK-21998][SQL] SortMergeJoinExec did not calculate i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19281 **[Test build #82054 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82054/testReport)** for PR 19281 at commit [`78f41e0`](https://github.com/apache/spark/commit/78f41e046ea4e307e43315b23ed3212e1f4c3f1b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19293: [SPARK-22079][SQL] Serializer in HiveOutputWriter miss l...
Github user LantaoJin commented on the issue: https://github.com/apache/spark/pull/19293 @gatorsmile @cloud-fan Please help to review --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDF...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/18659#discussion_r140396700 --- Diff: python/pyspark/serializers.py --- @@ -199,6 +211,55 @@ def __repr__(self): return "ArrowSerializer" +class ArrowPandasSerializer(ArrowSerializer): +""" +Serializes Pandas.Series as Arrow data. +""" + +def __init__(self): +super(ArrowPandasSerializer, self).__init__() --- End diff -- Do we need this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19301: [SPARK-22084][SQL] Fix performance regression in aggrega...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/19301 @cenyuhai This is an optimize for physical plan, and your case can be optimized. ```SQL select dt, geohash_of_latlng, sum(mt_cnt), sum(ele_cnt), round(sum(mt_cnt) * 1.0 * 100 / sum(mt_cnt_all), 2), round(sum(ele_cnt) * 1.0 * 100 / sum(ele_cnt_all), 2) from values(1, 2, 3, 4, 5, 6) as (dt, geohash_of_latlng, mt_cnt, ele_cnt, mt_cnt_all, ele_cnt_all) group by dt, geohash_of_latlng order by dt, geohash_of_latlng limit 10 ``` Before: ``` == Physical Plan == TakeOrderedAndProject(limit=10, orderBy=[dt#26 ASC NULLS FIRST,geohash_of_latlng#27 ASC NULLS FIRST], output=[dt#26,geohash_of_latlng#27,sum(mt_cnt)#38L,sum(ele_cnt)#39L,round((CAST((CAST((CAST(CAST(sum(CAST(mt_cnt AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) AS DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS DECIMAL(38,2)) / CAST(CAST(sum(CAST(mt_cnt_all AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(38,2))), 2)#40,round((CAST((CAST((CAST(CAST(sum(CAST(ele_cnt AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) AS DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS DECIMAL(38,2)) / CAST(CAST(sum(CAST(ele_cnt_all AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(38,2))), 2)#41]) +- *HashAggregate(keys=[dt#26, geohash_of_latlng#27], functions=[sum(cast(mt_cnt#28 as bigint)), sum(cast(ele_cnt#29 as bigint)), sum(cast(mt_cnt#28 as bigint)), sum(cast(mt_cnt_all#30 as bigint)), sum(cast(ele_cnt#29 as bigint)), sum(cast(ele_cnt_all#31 as bigint))]) +- Exchange hashpartitioning(dt#26, geohash_of_latlng#27, 200) +- *HashAggregate(keys=[dt#26, geohash_of_latlng#27], functions=[partial_sum(cast(mt_cnt#28 as bigint)), partial_sum(cast(ele_cnt#29 as bigint)), partial_sum(cast(mt_cnt#28 as bigint)), partial_sum(cast(mt_cnt_all#30 as bigint)), partial_sum(cast(ele_cnt#29 as bigint)), partial_sum(cast(ele_cnt_all#31 as bigint))]) +- LocalTableScan [dt#26, geohash_of_latlng#27, mt_cnt#28, ele_cnt#29, mt_cnt_all#30, ele_cnt_all#31] ``` After: ``` == Physical Plan == TakeOrderedAndProject(limit=10, orderBy=[dt#28 ASC NULLS FIRST,geohash_of_latlng#29 ASC NULLS FIRST], output=[dt#28,geohash_of_latlng#29,sum(mt_cnt)#34L,sum(ele_cnt)#35L,round((CAST((CAST((CAST(CAST(sum(CAST(mt_cnt AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) AS DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS DECIMAL(38,2)) / CAST(CAST(sum(CAST(mt_cnt_all AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(38,2))), 2)#36,round((CAST((CAST((CAST(CAST(sum(CAST(ele_cnt AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) AS DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS DECIMAL(38,2)) / CAST(CAST(sum(CAST(ele_cnt_all AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(38,2))), 2)#37]) +- *HashAggregate(keys=[dt#28, geohash_of_latlng#29], functions=[sum(cast(mt_cnt#30 as bigint)), sum(cast(ele_cnt#31 as bigint)), sum(cast(mt_cnt_all#32 as bigint)), sum(cast(ele_cnt_all#33 as bigint))]) +- Exchange hashpartitioning(dt#28, geohash_of_latlng#29, 200) +- *HashAggregate(keys=[dt#28, geohash_of_latlng#29], functions=[partial_sum(cast(mt_cnt#30 as bigint)), partial_sum(cast(ele_cnt#31 as bigint)), partial_sum(cast(mt_cnt_all#32 as bigint)), partial_sum(cast(ele_cnt_all#33 as bigint))]) +- LocalTableScan [dt#28, geohash_of_latlng#29, mt_cnt#30, ele_cnt#31, mt_cnt_all#32, ele_cnt_all#33] ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82053/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #82053 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82053/testReport)** for PR 18659 at commit [`b8ffa50`](https://github.com/apache/spark/commit/b8ffa50132d0290c0796fb99eb37fe010f56a182). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release build s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19312 **[Test build #82056 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82056/testReport)** for PR 19312 at commit [`aa4cbf6`](https://github.com/apache/spark/commit/aa4cbf69b080435bc836dc9820307fba6588). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release build s...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19312 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19194: [SPARK-20589] Allow limiting task concurrency per stage
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19194 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19194: [SPARK-20589] Allow limiting task concurrency per stage
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19194 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82052/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19194: [SPARK-20589] Allow limiting task concurrency per stage
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19194 **[Test build #82052 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82052/testReport)** for PR 19194 at commit [`dcfb861`](https://github.com/apache/spark/commit/dcfb86152c36032e9ec4f1545fb84c6f24287423). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19269: [SPARK-22026][SQL][WIP] data source v2 write path
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/19269#discussion_r140389681 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java --- @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.writer; + +import org.apache.spark.annotation.InterfaceStability; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SaveMode; +import org.apache.spark.sql.sources.v2.DataSourceV2Options; +import org.apache.spark.sql.sources.v2.WriteSupport; +import org.apache.spark.sql.types.StructType; + +/** + * A data source writer that is returned by + * {@link WriteSupport#createWriter(StructType, SaveMode, DataSourceV2Options)}. + * It can mix in various writing optimization interfaces to speed up the data saving. The actual + * writing logic is delegated to {@link WriteTask} that is returned by {@link #createWriteTask()}. + * + * The writing procedure is: + * 1. Create a write task by {@link #createWriteTask()}, serialize and send it to all the + * partitions of the input data(RDD). + * 2. For each partition, create a data writer with the write task, and write the data of the --- End diff -- Maybe you're right. If a source supports rollback for sequential tasks, then that might mean it has to support rollback for concurrent tasks. I was originally thinking of a case like JDBC without transactions. So an insert actually creates the rows and rolling back concurrent tasks would delete rows from the other task. But in that case, inserts are idempotent so it wouldn't matter. I'm not sure if there's a case where you can (or would) implement rolling back, but can't handle concurrency. Lets just leave it until someone has a use case for it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDF...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18659#discussion_r140389432 --- Diff: python/pyspark/sql/functions.py --- @@ -2142,18 +2159,26 @@ def udf(f=None, returnType=StringType()): | 8| JOHN DOE| 22| +--+--++ """ -def _udf(f, returnType=StringType()): -udf_obj = UserDefinedFunction(f, returnType) -return udf_obj._wrapped() +return _create_udf(f, returnType=returnType, vectorized=False) -# decorator @udf, @udf() or @udf(dataType()) -if f is None or isinstance(f, (str, DataType)): -# If DataType has been passed as a positional argument -# for decorator use it as a returnType -return_type = f or returnType -return functools.partial(_udf, returnType=return_type) + +@since(2.3) +def pandas_udf(f=None, returnType=StringType()): +""" +Creates a :class:`Column` expression representing a user defined function (UDF) that accepts +`Pandas.Series` as input arguments and outputs a `Pandas.Series` of the same length. + +:param f: python function if used as a standalone function +:param returnType: a :class:`pyspark.sql.types.DataType` object + +# TODO: doctest +""" +import inspect +# If function "f" does not define the optional kwargs, then wrap with a kwargs placeholder +if inspect.getargspec(f).keywords is None: +return _create_udf(lambda *a, **kwargs: f(*a), returnType=returnType, vectorized=True) --- End diff -- I'm fine to disallow 0-parameter pandas udf, as it's not a common case. We can add it when people request it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...
Github user marmbrus commented on the issue: https://github.com/apache/spark/pull/19314 LGTM! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19314 **[Test build #82055 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82055/testReport)** for PR 19314 at commit [`a4a02a6`](https://github.com/apache/spark/commit/a4a02a69bf41906c03e46c50d0eca75d6844465a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/19314 cc @marmbrus --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19314: [SPARK-22094][SS]processAllAvailable should check...
GitHub user zsxwing opened a pull request: https://github.com/apache/spark/pull/19314 [SPARK-22094][SS]processAllAvailable should check the query state ## What changes were proposed in this pull request? `processAllAvailable` should also check the query state and if the query is stopped, it should return. ## How was this patch tested? The new unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zsxwing/spark SPARK-22094 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19314.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19314 commit b222aef2179a36a8e2b0c3556ec60b2bfb53e228 Author: Shixiong ZhuDate: 2017-09-22T00:02:02Z processAllAvailable should check the query state --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/19294 +CC @weiqingy Can you try this PR with SHC and see if it works ? That is, remove your current workaround for SPARK-21549 from SHC and try writing to hbase with a spark version patched with this PR [1] That will allow us to have a real world test, and possibly surface issues. @szhem incorporating a test for the sql part will also help in this matter. [1] SHC will need the workaround even if this issue is resolved (since 2.2 has been released with this bug). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19122#discussion_r140375759 --- Diff: python/pyspark/ml/tests.py --- @@ -836,6 +836,27 @@ def test_save_load_simple_estimator(self): loadedModel = CrossValidatorModel.load(cvModelPath) self.assertEqual(loadedModel.bestModel.uid, cvModel.bestModel.uid) +def test_parallel_evaluation(self): +dataset = self.spark.createDataFrame( +[(Vectors.dense([0.0]), 0.0), + (Vectors.dense([0.4]), 1.0), + (Vectors.dense([0.5]), 0.0), + (Vectors.dense([0.6]), 1.0), + (Vectors.dense([1.0]), 1.0)] * 10, +["features", "label"]) + +lr = LogisticRegression() +grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build() +evaluator = BinaryClassificationEvaluator() + +# test save/load of CrossValidator +cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) +cv.setParallelism(1) +cvSerialModel = cv.fit(dataset) +cv.setParallelism(2) +cvParallelModel = cv.fit(dataset) +self.assertEqual(sorted(cvSerialModel.avgMetrics), sorted(cvParallelModel.avgMetrics)) --- End diff -- Don't sort the metrics. The metrics are guaranteed to be returned in the same order as the estimatorParamMaps, so they should match up already. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19122#discussion_r140375921 --- Diff: python/pyspark/ml/tuning.py --- @@ -14,15 +14,16 @@ # See the License for the specific language governing permissions and # limitations under the License. # - import itertools import numpy as np +from multiprocessing.pool import ThreadPool --- End diff -- style: This should be grouped with the other 3rd-party library imports --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org