[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18945 **[Test build #82067 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82067/testReport)** for PR 18945 at commit [`6e16cd8`](https://github.com/apache/spark/commit/6e16cd82434c82cd7213ae8ef2b52e1c42e607cf). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82067/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19281: [SPARK-21998][SQL] SortMergeJoinExec did not calc...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19281 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19281: [SPARK-21998][SQL] SortMergeJoinExec did not calculate i...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19281 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19281: [SPARK-21998][SQL] SortMergeJoinExec did not calculate i...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19281 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19261: [SPARK-22040] Add current_date function with timezone id
Github user jaceklaskowski commented on the issue: https://github.com/apache/spark/pull/19261 @rxin @gatorsmile Let me ask you a very similar question then, why does `CurrentDate` operator has the optional timezone parameter? What's the purpose? Wouldn't that answer your questions? I don't mind not having the change, but am curious what is the reason for the "mismatch"? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19319: [SPARK-21766][PySpark][SQL] DataFrame toPandas() raises ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19319 I'd go with this PR / approach. This approach and PR look pretty good. Let me help double check this tonight. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19320: [SPARK-22099] The 'job ids' list style needs to be chang...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19320 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19319: [SPARK-21766][PySpark][SQL] DataFrame toPandas() raises ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19319 **[Test build #82069 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82069/testReport)** for PR 19319 at commit [`779eb40`](https://github.com/apache/spark/commit/779eb400790cb04ff4d62d7701a2af1d3d58175f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide ju...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18015#discussion_r140421501 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala --- @@ -61,7 +59,37 @@ private[ui] class AllExecutionsPage(parent: SQLTab) extends WebUIPage("") with L details.parentNode.querySelector('.stage-details').classList.toggle('collapsed') }} -UIUtils.headerSparkPage("SQL", content, parent, Some(5000)) + +val summary: NodeSeq = + + + { + if (listener.getRunningExecutions.nonEmpty) { + + Running Queries: + {listener.getRunningExecutions.size} + + } + } + { + if (listener.getCompletedExecutions.nonEmpty) { + + Completed Queries: + {listener.getCompletedExecutions.size} + + } + } + { + if (listener.getFailedExecutions.nonEmpty) { + + Failed Queries: + {listener.getFailedExecutions.size} + + } + } --- End diff -- Please follow the style in the other files of the package `org.apache.spark.sql.execution.ui` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19320: [SPARK-22099] The 'job ids' list style needs to b...
GitHub user guoxiaolongzte opened a pull request: https://github.com/apache/spark/pull/19320 [SPARK-22099] The 'job ids' list style needs to be changed in the SQL page. ## What changes were proposed in this pull request? The 'job ids' list style needs to be changed in the SQL page. There are two reasons: 1. If a job id is a line, there are a lot of job ids, then the table row height will be high. As shown below: 2. should be consistent with the 'JDBC / ODBC Server' page style, I am in this way to modify the style. As shown below: My changes are as follows: ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/guoxiaolongzte/spark SPARK-22099 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19320.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19320 commit f6331c4b9922c3fce4bb2a8b0fedb66c16017b75 Author: guoxiaolong Date: 2017-09-22T06:33:52Z [SPARK-22099] The 'job ids' list style needs to be changed in the SQL page --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19319: [SPARK-21766][PySpark][SQL] DataFrame toPandas() raises ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19319 **[Test build #82068 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82068/testReport)** for PR 19319 at commit [`e12f576`](https://github.com/apache/spark/commit/e12f5768436543bfbb78fd0bb39b48c96e04286c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140420898 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1760,13 +1760,39 @@ def toPandas(self): "if using spark.sql.execution.arrow.enable=true" raise ImportError("%s\n%s" % (e.message, msg)) else: +import numpy as np dtype = {} +nullable_int_columns = set() + +def null_handler(rows, nullable_int_columns): +requires_double_precision = set() +for row in rows: +row = row.asDict() +for column in nullable_int_columns: +val = row[column] +dt = dtype[column] +if val is None and dt not in (np.float32, np.float64): +dt = np.float64 if column in requires_double_precision else np.float32 +dtype[column] = dt +elif val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. --- End diff -- I think they are represented as np.float64. I added a test in #19319. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19319: [SPARK-21766][PySpark][SQL] DataFrame toPandas() ...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/19319 [SPARK-21766][PySpark][SQL] DataFrame toPandas() raises ValueError with nullable int columns ## What changes were proposed in this pull request? When calling `DataFrame.toPandas()` (without Arrow enabled), if there is a `IntegralType` column (`IntegerType`, `ShortType`, `ByteType`) that has null values the following exception is thrown: ValueError: Cannot convert non-finite values (NA or inf) to integer This is because the null values first get converted to float NaN during the construction of the Pandas DataFrame in `from_records`, and then it is attempted to be converted back to to an integer where it fails. The fix is going to check if the Pandas DataFrame can cause such failure when converting, if so, we don't do the conversion and use the inferred type by Pandas. ## How was this patch tested? Added pyspark test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 SPARK-21766 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19319.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19319 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18945 **[Test build #82067 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82067/testReport)** for PR 18945 at commit [`6e16cd8`](https://github.com/apache/spark/commit/6e16cd82434c82cd7213ae8ef2b52e1c42e607cf). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19303: [SPARK-22085][CORE]When the application has no core left...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19303 IIUC, if there's no core left, requesting new executors should be a no-op, am I right? So there should be no problem even without your fix? From your patch, it looks like you're putting standalone specific logic to this general `ExecutorAllocationManager`, personally I would suggest not to do it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user logannc commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140419964 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1760,13 +1760,39 @@ def toPandas(self): "if using spark.sql.execution.arrow.enable=true" raise ImportError("%s\n%s" % (e.message, msg)) else: +import numpy as np dtype = {} +nullable_int_columns = set() + +def null_handler(rows, nullable_int_columns): +requires_double_precision = set() +for row in rows: +row = row.asDict() +for column in nullable_int_columns: +val = row[column] +dt = dtype[column] +if val is None and dt not in (np.float32, np.float64): +dt = np.float64 if column in requires_double_precision else np.float32 +dtype[column] = dt +elif val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. --- End diff -- Values above this cannot be represented losslessly as a `np.float32`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18945 Hey @logannc, let's don't make it complicated for now and go with their ways first - https://github.com/apache/spark/pull/18945#discussion_r134033952 and https://github.com/apache/spark/pull/18945#discussion_r134925269. Maybe we can make a followup later with some small benchmark results for the performance one and precision concern (I guess this one is not a regression BTW?). I think we should first match it with when `spark.sql.execution.arrow.enable` is enabled. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19279: [SPARK-22061] [ML]add pipeline model of SVM
Github user daweicheng closed the pull request at: https://github.com/apache/spark/pull/19279 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82066/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18945 **[Test build #82066 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82066/testReport)** for PR 18945 at commit [`d93a203`](https://github.com/apache/spark/commit/d93a2030d366bf1eb5ae2d6cc335894eddbc48dd). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140419255 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1760,13 +1760,39 @@ def toPandas(self): "if using spark.sql.execution.arrow.enable=true" raise ImportError("%s\n%s" % (e.message, msg)) else: +import numpy as np dtype = {} +nullable_int_columns = set() + +def null_handler(rows, nullable_int_columns): +requires_double_precision = set() +for row in rows: +row = row.asDict() +for column in nullable_int_columns: +val = row[column] +dt = dtype[column] +if val is None and dt not in (np.float32, np.float64): +dt = np.float64 if column in requires_double_precision else np.float32 +dtype[column] = dt +elif val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. --- End diff -- Why do we need this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82063/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18945 **[Test build #82063 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82063/testReport)** for PR 18945 at commit [`bd25923`](https://github.com/apache/spark/commit/bd259239c550b0b19311968aff9a69da29a6a05e). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82065/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19315: Updated english.txt word ordering
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19315 @animenon can you please fix the PR title like what other PR did. Also is this only for better readability or do you fix any other issue? IMO, I found that previous txt is more readable than your change, since they're ordered by different kind. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide jump link...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18015 **[Test build #82064 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82064/testReport)** for PR 18015 at commit [`21e2c31`](https://github.com/apache/spark/commit/21e2c31369b2223d0bee16b9bc98373ab0ec59a9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user logannc commented on the issue: https://github.com/apache/spark/pull/18945 I've continued to use @HyukjinKwon 's suggestion because it should be more performant and is capable of handling it without loss of precision. I believe I've addressed your concerns by only changing the type when we encounter a null (duh). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18945 **[Test build #82063 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82063/testReport)** for PR 18945 at commit [`bd25923`](https://github.com/apache/spark/commit/bd259239c550b0b19311968aff9a69da29a6a05e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19301: [SPARK-22084][SQL] Fix performance regression in ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19301#discussion_r140416279 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala --- @@ -72,11 +74,19 @@ object AggregateExpression { aggregateFunction: AggregateFunction, mode: AggregateMode, isDistinct: Boolean): AggregateExpression = { +val state = if (aggregateFunction.resolved) { + Seq(aggregateFunction.toString, aggregateFunction.dataType, +aggregateFunction.nullable, mode, isDistinct) +} else { + Seq(aggregateFunction.toString, mode, isDistinct) +} +val hashCode = state.map(Objects.hashCode).foldLeft(0)((a, b) => 31 * a + b) + AggregateExpression( aggregateFunction, mode, isDistinct, - NamedExpression.newExprId) + ExprId(hashCode)) --- End diff -- I don't think this is the right fix. Semantically the `b0` and `b1` in `SELECT SUM(b) AS b0, SUM(b) AS b1 ` are different aggregate functions, so they should have different `resultId`. It's kind of an optimization in aggregate planner, we should detect these semantically different but duplicated aggregate functions and only plan one aggrega function. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide jump link...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/18015 Yes, I'm fine with it. @ajbozarth would you please take another look on this PR? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide jump link...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/18015 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user DaimonPl commented on the issue: https://github.com/apache/spark/pull/16578 @mallman how about adding comment explaining why such workaround was done + bug number in parquet-mr ? So in future once that bug is fixed, code can be cleaned. Also maybe it's time to remove "DO NOT MERGE" from title? As I understand most of comments were addressed :) Thank you very much for work on this feature. I must admit that we are looking forward to have this merged. For us this will be most important improvement in Spark 2.3.0 (I hope it will be part of 2.3.0 :) ) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18015: [SAPRK-20785][WEB-UI][SQL]Spark should provide ju...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/18015#discussion_r140416046 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala --- @@ -61,7 +59,37 @@ private[ui] class AllExecutionsPage(parent: SQLTab) extends WebUIPage("") with L details.parentNode.querySelector('.stage-details').classList.toggle('collapsed') }} -UIUtils.headerSparkPage("SQL", content, parent, Some(5000)) + +val summary: NodeSeq = + + + { + if (listener.getRunningExecutions.nonEmpty) { + + Running Queries: + {listener.getRunningExecutions.size} + + } + } + { + if (listener.getCompletedExecutions.nonEmpty) { + + Completed Queries: + {listener.getCompletedExecutions.size} + + } + } + { + if (listener.getFailedExecutions.nonEmpty) { + + Failed Queries: + {listener.getFailedExecutions.size} + + } + } --- End diff -- Is the indention here correct? This seems a little weird to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82062/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18945 **[Test build #82062 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82062/testReport)** for PR 18945 at commit [`6e248dd`](https://github.com/apache/spark/commit/6e248ddf96122910468a3f20125ff4fc9f32f299). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140415073 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = set() +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. +val = np.float64(val) +dt = np.float64 +dtype[column] = np.float64 +else: +val = np.float32(val) +if dt not in (np.float32, np.float64): +dt = np.float32 +dtype[column] = np.float32 +row[column] = val +row = Row(**row) +yield row +row_handler = lambda x,y: x for field in self.schema: pandas_type = _to_corrected_pandas_type(field.dataType) +if pandas_type in (np.int8, np.int16, np.int32) and field.nullable: +columns_with_null_int.add(field.name) +row_handler = null_handler +pandas_type = np.float32 --- End diff -- I will take my suggestion back. I think thier suggestions are better than mine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user logannc commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140414783 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = set() +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. +val = np.float64(val) +dt = np.float64 +dtype[column] = np.float64 +else: +val = np.float32(val) +if dt not in (np.float32, np.float64): +dt = np.float32 +dtype[column] = np.float32 +row[column] = val +row = Row(**row) +yield row +row_handler = lambda x,y: x for field in self.schema: pandas_type = _to_corrected_pandas_type(field.dataType) +if pandas_type in (np.int8, np.int16, np.int32) and field.nullable: +columns_with_null_int.add(field.name) +row_handler = null_handler +pandas_type = np.float32 --- End diff -- Ah, I see where I got confused. I had started with @ueshin 's suggestion but abandoned it because I didn't want to create the DataFrame before the type correction because I was also looking at @HyukjinKwon 's suggestion. I somehow ended up combining them incorrectly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18945 **[Test build #82062 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82062/testReport)** for PR 18945 at commit [`6e248dd`](https://github.com/apache/spark/commit/6e248ddf96122910468a3f20125ff4fc9f32f299). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140414202 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = set() +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. +val = np.float64(val) +dt = np.float64 +dtype[column] = np.float64 +else: +val = np.float32(val) +if dt not in (np.float32, np.float64): +dt = np.float32 +dtype[column] = np.float32 +row[column] = val +row = Row(**row) +yield row +row_handler = lambda x,y: x for field in self.schema: pandas_type = _to_corrected_pandas_type(field.dataType) +if pandas_type in (np.int8, np.int16, np.int32) and field.nullable: +columns_with_null_int.add(field.name) +row_handler = null_handler +pandas_type = np.float32 --- End diff -- A simple wrong for this line is, even this condition is met, don't necessarily meaning there are null values in the column. But you forcibly set the type to np.float32. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140414042 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = set() +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. +val = np.float64(val) +dt = np.float64 +dtype[column] = np.float64 +else: +val = np.float32(val) +if dt not in (np.float32, np.float64): +dt = np.float32 +dtype[column] = np.float32 +row[column] = val +row = Row(**row) +yield row +row_handler = lambda x,y: x for field in self.schema: pandas_type = _to_corrected_pandas_type(field.dataType) +if pandas_type in (np.int8, np.int16, np.int32) and field.nullable: +columns_with_null_int.add(field.name) +row_handler = null_handler +pandas_type = np.float32 --- End diff -- Have you read carefully the comments in https://github.com/apache/spark/pull/18945#discussion_r134033952, https://github.com/apache/spark/pull/18945#discussion_r134925269? They are good suggestions for this issue. I don't know why you don't want to follow them to check null values with Pandas... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user logannc commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140413579 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = set() +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. +val = np.float64(val) +dt = np.float64 +dtype[column] = np.float64 +else: +val = np.float32(val) +if dt not in (np.float32, np.float64): +dt = np.float32 +dtype[column] = np.float32 +row[column] = val +row = Row(**row) +yield row +row_handler = lambda x,y: x for field in self.schema: pandas_type = _to_corrected_pandas_type(field.dataType) +if pandas_type in (np.int8, np.int16, np.int32) and field.nullable: +columns_with_null_int.add(field.name) +row_handler = null_handler +pandas_type = np.float32 --- End diff -- Can you elaborate? I believe it is, per my reply to your comment in the `null_handler`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19204 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19204: [SPARK-21981][PYTHON][ML] Added Python interface for Clu...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/19204 Merged into master, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18945 **[Test build #82061 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82061/testReport)** for PR 18945 at commit [`14f36c3`](https://github.com/apache/spark/commit/14f36c354f65a34e3e06cd4d35029e5f8f2b79f0). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82061/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18945 **[Test build #82061 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82061/testReport)** for PR 18945 at commit [`14f36c3`](https://github.com/apache/spark/commit/14f36c354f65a34e3e06cd4d35029e5f8f2b79f0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user logannc commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140412857 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = {} +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: --- End diff -- If `pandas_type in (np.int8, np.int16, np.int32) and field.nullable` and there are ANY non-null values, the dtype of the column is changed to `np.float32` or `np.float64`, both of which properly handle `None` values. That said, if the entire column was `None`, it would fail. Therefore I have preemptively changed the type on line 1787 to `np.float32`. Per `null_handler`, it may still change to `np.float64` if needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user logannc commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140412745 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = {} +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: --- End diff -- If `pandas_type in (np.int8, np.int16, np.int32) and field.nullable` and there are ANY non-null values, the dtype of the column is changed to `np.float32` or `np.float64`, both of which properly handle `None` values. That said, if the entire column was `None`, it would fail. Therefore I have preemptively changed the type on line 1787 to `np.float32`. Per `null_handler`, it may still change to `np.float64` if needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140412632 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = set() +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. +val = np.float64(val) +dt = np.float64 +dtype[column] = np.float64 +else: +val = np.float32(val) +if dt not in (np.float32, np.float64): +dt = np.float32 +dtype[column] = np.float32 +row[column] = val +row = Row(**row) +yield row +row_handler = lambda x,y: x for field in self.schema: pandas_type = _to_corrected_pandas_type(field.dataType) +if pandas_type in (np.int8, np.int16, np.int32) and field.nullable: +columns_with_null_int.add(field.name) +row_handler = null_handler +pandas_type = np.float32 --- End diff -- I don't think this is a correct fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18945 **[Test build #82060 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82060/testReport)** for PR 18945 at commit [`b313a3b`](https://github.com/apache/spark/commit/b313a3b8fc88898423940f195ab16bd3a57c0061). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82060/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18945 **[Test build #82060 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82060/testReport)** for PR 18945 at commit [`b313a3b`](https://github.com/apache/spark/commit/b313a3b8fc88898423940f195ab16bd3a57c0061). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19229: [SPARK-22001][ML][SQL] ImputerModel can do withCo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19229#discussion_r140412254 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -2102,6 +2102,55 @@ class Dataset[T] private[sql]( } /** + * Returns a new Dataset by adding columns or replacing the existing columns that has + * the same names. + */ + private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = { --- End diff -- @cloud-fan should look at this `withColumns` before in #17819. cc @cloud-fan to see if you has more comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19314: [SPARK-22094][SS]processAllAvailable should check...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19314 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19229 ping @zhengruifeng @WeichenXu123 Any more comments on this? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/19314 Thanks! Merging to master and branch-2.2 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [WIP][SPARK-22063][R] Upgrades lintr to latest commit sh...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19290 I initially did this, for example, ``` \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between- r-and-spark}{Spark Data Types} for available data types. ``` this passes the lint and doc is find but cran check is failed. I actually tried to find out the way but ended up with `nolint`. ... will try to read the doc one more and out few more cases locally. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [WIP][SPARK-22063][R] Upgrades lintr to latest commit sh...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19290 Doh, you mean the current status. Yes, I checked. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19312#discussion_r140410448 --- Diff: dev/create-release/release-build.sh --- @@ -95,6 +95,28 @@ if [ -z "$SPARK_VERSION" ]; then | grep -v INFO | grep -v WARNING | grep -v Download) fi +# Verify we have the right java version set +java_version=$("${JAVA_HOME}"/bin/javac -version 2>&1 | cut -d " " -f 2) --- End diff -- @holdenk, should we maybe catch the case when `JAVA_HOME` is missing too? If so, I think we could do something like ... ```bash if [ -z "$JAVA_HOME" ]; then echo "Please set JAVA_HOME." exit 1 fi ... ``` Or maybe... ```bash if [[ -x "$JAVA_HOME/bin/javac" ]]; then javac_cmd="$JAVA_HOME/bin/javac" else javac_cmd=javac fi java_version=$("$javac_cmd" -version 2>&1 | cut -d " " -f 2) ... ``` I tested both in my local. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19318: [SPARK-22096][ML] use aggregateByKeyLocally in feature f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19318 **[Test build #82059 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82059/testReport)** for PR 19318 at commit [`efb0fe9`](https://github.com/apache/spark/commit/efb0fe9c0544d8666c423ba9bde533735961ea75). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19318: [SPARK-22096][ML] use aggregateByKeyLocally in feature f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19318 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82059/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19318: [SPARK-22096][ML] use aggregateByKeyLocally in feature f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19318 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [WIP][SPARK-22063][R] Upgrades lintr to latest commit sh...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/19290 btw, could you check if haven't already, if `nolint` around the `http` link, roxygen is going to handle that correctly? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19318: [SPARK-22096][ML] use aggregateByKeyLocally in feature f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19318 **[Test build #82059 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82059/testReport)** for PR 19318 at commit [`efb0fe9`](https://github.com/apache/spark/commit/efb0fe9c0544d8666c423ba9bde533735961ea75). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19318: [SPARK-22096][ML] use aggregateByKeyLocally in fe...
GitHub user VinceShieh opened a pull request: https://github.com/apache/spark/pull/19318 [SPARK-22096][ML] use aggregateByKeyLocally in feature frequency calc⦠## What changes were proposed in this pull request? NaiveBayes currently takes aggreateByKey followed by a collect to calculate frequency for each feature/label. We can implement a new function 'aggregateByKeyLocally' in RDD that merges locally on each mapper before sending results to a reducer to save one stage. We tested on NaiveBayes and see ~20% performance gain with these changes. Signed-off-by: Vincent Xie ## How was this patch tested? existing test You can merge this pull request into a Git repository by running: $ git pull https://github.com/VinceShieh/spark SPARK-22096 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19318.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19318 commit efb0fe9c0544d8666c423ba9bde533735961ea75 Author: Vincent Xie Date: 2017-09-22T03:57:08Z [SPARK-22096][ML] use aggregateByKeyLocally in feature frequency calculation NaiveBayes currently takes aggreateByKey followed by a collect to calculate frequency for each feature/label. We can implement a new function 'aggregateByKeyLocally' in RDD that merges locally on each mapper before sending results to a reducer to save one stage. We tested on NaiveBayes and see ~20% performance gain with these changes. Signed-off-by: Vincent Xie --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19122 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19122 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82058/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19122 **[Test build #82058 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82058/testReport)** for PR 19122 at commit [`3464dfe`](https://github.com/apache/spark/commit/3464dfea1f008e945a5e608b593877d1cbdf0e35). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19317 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19317 cc @VinceShieh --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19317: [SPARK-22098][CORE] Add new method aggregateByKey...
GitHub user ConeyLiu opened a pull request: https://github.com/apache/spark/pull/19317 [SPARK-22098][CORE] Add new method aggregateByKeyLocally in RDD ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-22096 NaiveBayes currently takes aggreateByKey followed by a collect to calculate frequency for each feature/label. We can implement a new function 'aggregateByKeyLocally' in RDD that merges locally on each mapper before sending results to a reducer to save one stage. We tested on NaiveBayes and see ~20% performance gain with these changes. This is a subtask of our improvement. ## How was this patch tested? New UT. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ConeyLiu/spark aggregatebykeylocally Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19317.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19317 commit 73a85dc5963ac46f181a9499deabb18da4ccc308 Author: Xianyang Liu Date: 2017-08-31T05:16:09Z add new method 'aggregateByKeyLocally' --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19316: [SPARK-22097][CORE]Call serializationStream.close after ...
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19316 @cloud-fan Pls take a look. Thanks a lot. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release build s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19312 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82056/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release build s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19312 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19312: [SPARK-22072][SPARK-22071][BUILD]Improve release build s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19312 **[Test build #82056 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82056/testReport)** for PR 19312 at commit [`aa4cbf6`](https://github.com/apache/spark/commit/aa4cbf69b080435bc836dc9820307fba6588). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19316: [SPARK-22097][CORE]Call serializationStream.close...
Github user ConeyLiu commented on a diff in the pull request: https://github.com/apache/spark/pull/19316#discussion_r140408246 --- Diff: core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala --- @@ -387,11 +387,18 @@ private[spark] class MemoryStore( // the block's actual memory usage has exceeded the unroll memory by a small amount, so we // perform one final call to attempt to allocate additional memory if necessary. if (keepUnrolling) { - serializationStream.close() - reserveAdditionalMemoryIfNecessary() + serializationStream.flush() + if (bbos.size > unrollMemoryUsedByThisBlock) { +val amountToRequest = bbos.size - unrollMemoryUsedByThisBlock +keepUnrolling = reserveUnrollMemoryForThisTask(blockId, amountToRequest, memoryMode) +if (keepUnrolling) { + unrollMemoryUsedByThisBlock += amountToRequest +} + } } if (keepUnrolling) { + serializationStream.close() --- End diff -- Here, we should close the `serializationStream` after we check it again. Previous code we close it first, and then request the exceed memory. So there is a potential problem that we can't request enought memory, while the `serializationStream` is closeed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19316: [SPARK-22097][CORE]Call serializationStream.close...
Github user ConeyLiu commented on a diff in the pull request: https://github.com/apache/spark/pull/19316#discussion_r140408116 --- Diff: core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala --- @@ -387,11 +387,18 @@ private[spark] class MemoryStore( // the block's actual memory usage has exceeded the unroll memory by a small amount, so we // perform one final call to attempt to allocate additional memory if necessary. if (keepUnrolling) { - serializationStream.close() - reserveAdditionalMemoryIfNecessary() + serializationStream.flush() + if (bbos.size > unrollMemoryUsedByThisBlock) { +val amountToRequest = bbos.size - unrollMemoryUsedByThisBlock --- End diff -- Here, we only need request the `bbos.size - unrollMemoryUsedByThisBlock`. I'm sorry, this mistake maybe introduced by my previous patch [SPARK-21923](https://github.com/apache/spark/pull/19135). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19316: [SPARK-22097][CORE]Call serializationStream.close after ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19316 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19316: [SPARK-22097][CORE]Call serializationStream.close...
GitHub user ConeyLiu opened a pull request: https://github.com/apache/spark/pull/19316 [SPARK-22097][CORE]Call serializationStream.close after we requested enough memory ## What changes were proposed in this pull request? Current code, we close the `serializationStream` after we unrolled the block. However, there is a otential problem that the size of underlying vector or stream maybe larger the memory we requested. So here, we need check it agin carefully. ## How was this patch tested? Existing UT. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ConeyLiu/spark putIteratorAsBytes Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19316.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19316 commit bfe162e3aad300414dcc3fe25a3d70025e1795dd Author: Xianyang Liu Date: 2017-09-22T03:29:39Z close the serializationStream after check the memory request --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19168: [SPARK-21956][CORE] Fetch up to max bytes when buf reall...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19168 Sorry for replying so late. I add some benchmark testing for this pr @kiszk . And @jerryshao could you help review this pr?Thanks ``` Running benchmark: Benchmark fetch before vs after releasing buffer Running case: Testing fetch before releasing! Stopped after 10 iterations, 2423 ms Running case: Testing fetch after releasing! Stopped after 18 iterations, 2036 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17 on Linux 4.4.0-64-generic Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz Benchmark fetch before vs after releasing buffer: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative Testing fetch before releasing! 46 / 242345.0 2.9 1.0X Testing fetch after releasing! 73 / 113215.7 4.6 0.6X ``` ``` Running benchmark: Benchmark fetch before vs after releasing buffer Running case: Testing fetch before releasing! Stopped after 10 iterations, 3888 ms Running case: Testing fetch after releasing! Stopped after 10 iterations, 3970 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17 on Linux 4.4.0-64-generic Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz Benchmark fetch before vs after releasing buffer: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative Testing fetch before releasing!100 / 389157.8 6.3 1.0X Testing fetch after releasing! 151 / 397104.3 9.6 0.7X ``` ``` Running benchmark: Benchmark fetch before vs after releasing buffer Running case: Testing fetch before releasing! Stopped after 15 iterations, 2016 ms Running case: Testing fetch after releasing! Stopped after 14 iterations, 2110 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17 on Linux 4.4.0-64-generic Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz Benchmark fetch before vs after releasing buffer: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative Testing fetch before releasing! 43 / 134363.8 2.7 1.0X Testing fetch after releasing! 99 / 151158.1 6.3 0.4X ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19278 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19278 **[Test build #82057 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82057/testReport)** for PR 19278 at commit [`8f78f59`](https://github.com/apache/spark/commit/8f78f596473877f3e8a0169f998f16a6bf1a8f5a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19278 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82057/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19278 @jkbradley Sure I tested the backwards compatibility. Part of the reason I changed into `DefaultParamReader.getAndSetParams` is for backwards compatibility. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19122 **[Test build #82058 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82058/testReport)** for PR 19122 at commit [`3464dfe`](https://github.com/apache/spark/commit/3464dfea1f008e945a5e608b593877d1cbdf0e35). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19122#discussion_r140402700 --- Diff: python/pyspark/ml/tests.py --- @@ -836,6 +836,27 @@ def test_save_load_simple_estimator(self): loadedModel = CrossValidatorModel.load(cvModelPath) self.assertEqual(loadedModel.bestModel.uid, cvModel.bestModel.uid) +def test_parallel_evaluation(self): +dataset = self.spark.createDataFrame( +[(Vectors.dense([0.0]), 0.0), + (Vectors.dense([0.4]), 1.0), + (Vectors.dense([0.5]), 0.0), + (Vectors.dense([0.6]), 1.0), + (Vectors.dense([1.0]), 1.0)] * 10, +["features", "label"]) + +lr = LogisticRegression() +grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build() +evaluator = BinaryClassificationEvaluator() + +# test save/load of CrossValidator +cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) +cv.setParallelism(1) +cvSerialModel = cv.fit(dataset) +cv.setParallelism(2) +cvParallelModel = cv.fit(dataset) +self.assertEqual(sorted(cvSerialModel.avgMetrics), sorted(cvParallelModel.avgMetrics)) --- End diff -- hmm... I tried. But how to get model parents ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19315: Updated english.txt word ordering
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19315 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19315: Updated english.txt word ordering
GitHub user animenon opened a pull request: https://github.com/apache/spark/pull/19315 Updated english.txt word ordering Ordered alphabetically, for better readability. ## What changes were proposed in this pull request? Alphabetical ordering of the stop words. You can merge this pull request into a Git repository by running: $ git pull https://github.com/animenon/spark patch-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19315.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19315 commit 57c282721c63487a82bdd6959c6ff5f6ce9f66ad Author: Anirudh Date: 2017-09-22T02:40:30Z Updated english.txt word ordering Ordered alphabetically, for better readability. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19314 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82055/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19314 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19314: [SPARK-22094][SS]processAllAvailable should check the qu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19314 **[Test build #82055 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82055/testReport)** for PR 19314 at commit [`a4a02a6`](https://github.com/apache/spark/commit/a4a02a69bf41906c03e46c50d0eca75d6844465a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13794: [SPARK-15574][ML][PySpark] Python meta-algorithms in Sca...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13794 cc @srowen Can you help close this ? We won't need this feature for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org