[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r234432181 --- Diff: R/pkg/R/mllib_clustering.R --- @@ -610,3 +616,57 @@ setMethod("write.ml", signature(object = "LDAModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + +#' PowerIterationClustering +#' +#' A scalable graph clustering algorithm. Users can call \code{spark.assignClusters} to +#' return a cluster assignment for each input vertex. +#' +# Run the PIC algorithm and returns a cluster assignment for each input vertex. +#' @param data A SparkDataFrame. +#' @param k The number of clusters to create. +#' @param initMode Param for the initialization algorithm. +#' @param maxIter Param for maximum number of iterations. +#' @param srcCol Param for the name of the input column for source vertex IDs. +#' @param dstCol Name of the input column for destination vertex IDs. +#' @param weightCol Param for weight column name. If this is not set or \code{NULL}, +#' we treat all instance weights as 1.0. +#' @param ... additional argument(s) passed to the method. +#' @return A dataset that contains columns of vertex id and the corresponding cluster for the id. +#' The schema of it will be: +#' \code{id: Long} +#' \code{cluster: Int} +#' @rdname spark.powerIterationClustering +#' @aliases assignClusters,PowerIterationClustering-method,SparkDataFrame-method +#' @examples +#' \dontrun{ +#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), +#' list(1L, 2L, 1.0), list(3L, 4L, 1.0), +#' list(4L, 0L, 0.1)), schema = c("src", "dst", "weight")) +#' clusters <- spark.assignClusters(df, initMode="degree", weightCol="weight") +#' showDF(clusters) +#' } +#' @note spark.assignClusters(SparkDataFrame) since 3.0.0 +setMethod("spark.assignClusters", + signature(data = "SparkDataFrame"), + function(data, k = 2L, initMode = "random", maxIter = 20L, srcCol = "src", +dstCol = "dst", weightCol = NULL) { --- End diff -- I think we try to avoid srcCol dstCol in R (I think there are other R ml APIs like that) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r234432019 --- Diff: R/pkg/R/mllib_clustering.R --- @@ -610,3 +616,57 @@ setMethod("write.ml", signature(object = "LDAModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + +#' PowerIterationClustering +#' +#' A scalable graph clustering algorithm. Users can call \code{spark.assignClusters} to +#' return a cluster assignment for each input vertex. +#' +# Run the PIC algorithm and returns a cluster assignment for each input vertex. +#' @param data A SparkDataFrame. +#' @param k The number of clusters to create. +#' @param initMode Param for the initialization algorithm. +#' @param maxIter Param for maximum number of iterations. +#' @param srcCol Param for the name of the input column for source vertex IDs. +#' @param dstCol Name of the input column for destination vertex IDs. +#' @param weightCol Param for weight column name. If this is not set or \code{NULL}, +#' we treat all instance weights as 1.0. +#' @param ... additional argument(s) passed to the method. +#' @return A dataset that contains columns of vertex id and the corresponding cluster for the id. +#' The schema of it will be: +#' \code{id: Long} +#' \code{cluster: Int} +#' @rdname spark.powerIterationClustering +#' @aliases assignClusters,PowerIterationClustering-method,SparkDataFrame-method +#' @examples +#' \dontrun{ +#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), +#' list(1L, 2L, 1.0), list(3L, 4L, 1.0), +#' list(4L, 0L, 0.1)), schema = c("src", "dst", "weight")) +#' clusters <- spark.assignClusters(df, initMode="degree", weightCol="weight") +#' showDF(clusters) +#' } +#' @note spark.assignClusters(SparkDataFrame) since 3.0.0 +setMethod("spark.assignClusters", + signature(data = "SparkDataFrame"), + function(data, k = 2L, initMode = "random", maxIter = 20L, srcCol = "src", --- End diff -- set valid values for initMode and check for it - eg. https://github.com/apache/spark/pull/23072/files#diff-d9f92e07db6424e2527a7f9d7caa9013R355 and `match.arg(initMode)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r234432049 --- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd --- @@ -968,6 +970,17 @@ predicted <- predict(model, df) head(predicted) ``` + Power Iteration Clustering + +Power Iteration Clustering (PIC) is a scalable graph clustering algorithm. `spark.assignClusters` method runs the PIC algorithm and returns a cluster assignment for each input vertex. + +```{r} +df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), + list(1L, 2L, 1.0), list(3L, 4L, 1.0), + list(4L, 0L, 0.1)), schema = c("src", "dst", "weight")) +head(spark.assignClusters(df, initMode="degree", weightCol="weight")) --- End diff -- spacing: `initMode = "degree", weightCol = "weight"` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23073: [SPARK-26104] expose pci info to task scheduler
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23073#discussion_r234431864 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/ExecutorData.scala --- @@ -27,12 +27,14 @@ import org.apache.spark.rpc.{RpcAddress, RpcEndpointRef} * @param executorHost The hostname that this executor is running on * @param freeCores The current number of cores available for work on the executor * @param totalCores The total number of cores available to the executor + * @param pcis The external devices avaliable to the executor --- End diff -- available --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23073: [SPARK-26104] expose pci info to task scheduler
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/23073 please put ^ comment into PR description (because comment is not included in commit message once the PR is merged) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23042: [SPARK-26070][SQL] add rule for implicit type coe...
Github user uzadude commented on a diff in the pull request: https://github.com/apache/spark/pull/23042#discussion_r234431689 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -138,6 +138,11 @@ object TypeCoercion { case (DateType, TimestampType) => if (conf.compareDateTimestampInTimestamp) Some(TimestampType) else Some(StringType) +// to support a popular use case of tables using Decimal(X, 0) for long IDs instead of strings +// see SPARK-26070 for more details +case (n: DecimalType, s: StringType) if n.scale == 0 => Some(DecimalType(n.precision, n.scale)) --- End diff -- I personally agree with @cloud-fan that there are a few types that are "definitely safe", and as the user is not always responsible to his input tables, I believe convinience is more important than schema definitions. Also, even count() returns a bigint then you'll have to filter 'count(*)>100L' which means huge regression. I believe that the "definitely safe" list is very short and we should use it. @mgaido91, in your examples I do agree that Double to Decimal is not safe and so is String to almost anything. the trivial safes are something like (Long, Int), (Int, Double), (Decimal, Decimal) - that could be expanded to the same precision and scale, maybe (Data, TimeStamp).. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23076: [SPARK-26103][SQL] Added maxDepth to limit the length of...
Github user DaveDeCaprio commented on the issue: https://github.com/apache/spark/pull/23076 This contribution is my original work and I license the work to the project under the projectâs open source license. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23076: [SPARK-26103][SQL] Added maxDepth to limit the length of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23076 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23076: [SPARK-26103][SQL] Added maxDepth to limit the length of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23076 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23076: [SPARK-26103][SQL] Added maxDepth to limit the length of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23076 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23076: [SPARK-26103][SQL] Added maxDepth to limit the le...
GitHub user DaveDeCaprio opened a pull request: https://github.com/apache/spark/pull/23076 [SPARK-26103][SQL] Added maxDepth to limit the length of text plans Nested query plans can get extremely large (hundreds of megabytes). ## What changes were proposed in this pull request? The PR puts in a limit on the nesting depth of trees to be printed when writing a plan string. * The default limit is 15, which allows for reasonably nested plans. * A new configuration parameter called spark.debug.maxToStringTreeDepth was added to control the depth. * When plans are truncated, "..." is printed to indicate that tree elements were removed. * A warning is printed out the first time a truncated plan is displayed. The warning explains what happened and how to adjust the limit. ## How was this patch tested? A new unit test in QueryExecutionSuite which creates a highly nested plan and then ensures that the printed plan is not too long. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/DaveDeCaprio/spark max-log-tree-depth Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23076.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23076 commit adc4f8efd4b51b77d3600bcba8f331e92f7ea3c6 Author: Dave DeCaprio Date: 2018-11-18T06:29:16Z Added maxDepth to treeString which limits the depth of the printed string. commit 3a9743fbc89358055c37cc45437f191fc5f15957 Author: Dave DeCaprio Date: 2018-11-18T06:34:42Z Added maxDepth to treeString which limits the depth of the printed string. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23069: [SPARK-26026][BUILD] Published Scaladoc jars missing fro...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23069 **[Test build #4432 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4432/testReport)** for PR 23069 at commit [`25a311b`](https://github.com/apache/spark/commit/25a311beb9da709b61931dca12a7c443f43efa65). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23065: [SPARK-26090][CORE][SQL][ML] Resolve most miscellaneous ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23065 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98972/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23065: [SPARK-26090][CORE][SQL][ML] Resolve most miscellaneous ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23065 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23065: [SPARK-26090][CORE][SQL][ML] Resolve most miscellaneous ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23065 **[Test build #98972 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98972/testReport)** for PR 23065 at commit [`4665696`](https://github.com/apache/spark/commit/4665696f2b28e56b2aa15a2e1b85ce3ff11b3178). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of primitive type under...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/23054 Ok. I will add a flag. Thanks @rxin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23075: [SPARK-26084][SQL] Fixes unresolved AggregateExpression....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23075 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23075: [SPARK-26084][SQL] Fixes unresolved AggregateExpression....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23075 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23075: [SPARK-26084][SQL] Fixes unresolved AggregateExpression....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23075 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23075: [SPARK-26084][SQL] Fixes unresolved AggregateExpr...
GitHub user ssimeonov opened a pull request: https://github.com/apache/spark/pull/23075 [SPARK-26084][SQL] Fixes unresolved AggregateExpression.references exception ## What changes were proposed in this pull request? This PR fixes an exception in `AggregateExpression.references` called on unresolved expressions. It implements the solution proposed in [SPARK-26084](https://issues.apache.org/jira/browse/SPARK-26084), a minor refactoring that removes the unnecessary dependence on `AttributeSet.toSeq`, which requires expression IDs and, therefore, can only execute successfully for resolved expressions. The refactored implementation is both simpler and faster, eliminating the conversion of a `Set` to a `Seq` and back to `Set`. ## How was this patch tested? Local tests pass. I added no new tests as (a) the new behavior has no failing case and (b) this is a simple refactoring. @hvanhovell You can merge this pull request into a Git repository by running: $ git pull https://github.com/swoop-inc/spark ss_SPARK-26084 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23075.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23075 commit 178f0a5dff9f7eb8887ed711727b2f83af40ae8a Author: Simeon Simeonov Date: 2018-11-18T01:05:07Z [SPARK-26084][SQL] Fixes unresolved AggregateExpression.references exception Implements the solution proposed in [SPARK-26084](https://issues.apache.org/jira/browse/SPARK-26084), a minor refactoring that removes the unnecessary dependence on `AttributeSet.toSeq`, which requires expression IDs and, therefore, can only execute successfully for resolved expressions. The refactored implementation is both simpler and faster, eliminating the conversion of a `Set` to a `Seq` and back to `Set`. I added no new tests as (a) the new behavior has no failing case and (b) this is a simple refactoring. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23069: [SPARK-26026][BUILD] Published Scaladoc jars missing fro...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23069 **[Test build #4432 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4432/testReport)** for PR 23069 at commit [`25a311b`](https://github.com/apache/spark/commit/25a311beb9da709b61931dca12a7c443f43efa65). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23065: [SPARK-26090][CORE][SQL][ML] Resolve most miscellaneous ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23065 **[Test build #98972 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98972/testReport)** for PR 23065 at commit [`4665696`](https://github.com/apache/spark/commit/4665696f2b28e56b2aa15a2e1b85ce3ff11b3178). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23065: [SPARK-26090][CORE][SQL][ML] Resolve most miscellaneous ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23065 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5115/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23065: [SPARK-26090][CORE][SQL][ML] Resolve most miscellaneous ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23065 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23074: [SPARK-19798] Refresh table does not have effect on othe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23074 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23074: [SPARK-19798] Refresh table does not have effect on othe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23074 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23074: [SPARK-19798] Refresh table does not have effect on othe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23074 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23074: [SPARK-19798] Refresh table does not have effect ...
GitHub user gbloisi opened a pull request: https://github.com/apache/spark/pull/23074 [SPARK-19798] Refresh table does not have effect on other sessions than the issuing one ## What changes were proposed in this pull request? Refresh table command does not have effect on other sessions than the issuing one. Move table relation cache from session catalog to session shared state so that different sessions can synchronize when a table is modified and refreshed. ## How was this patch tested? New test in HiveMetadataCacheSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/gbloisi/spark shared-session-cache Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23074.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23074 commit bdd677c94c4e198d1d012c3c66a06ba791dc95bb Author: Giambattista Bloisi Date: 2018-11-17T22:35:19Z Refresh table command do not have effect on other sessions than the issuing one. Move table relation cache from session catalog to session sharedstate so that different sessions can synchronize when refresh table command is issued. New test in HiveMetadataCacheSuite demonstrates the need. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23073: [SPARK-26104] expose pci info to task scheduler
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23073 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23072 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23073: [SPARK-26104] expose pci info to task scheduler
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23073 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23073: [SPARK-26104] expose pci info to task scheduler
GitHub user chenqin opened a pull request: https://github.com/apache/spark/pull/23073 [SPARK-26104] expose pci info to task scheduler ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chenqin/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23073.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23073 commit 096ce4c1d85a9fad9a5601bd438f9bee86cad2c1 Author: Chen Qin Date: 2018-11-17T22:29:37Z expose pci info to task scheduler --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23072 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98971/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23073: [SPARK-26104] expose pci info to task scheduler
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23073 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23072 **[Test build #98971 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98971/testReport)** for PR 23072 at commit [`9e2b0f9`](https://github.com/apache/spark/commit/9e2b0f9ffe0866fa328bc677500e4f3a49ff384b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23071: [SPARK-26102][SQL][TEST] Extracting common CSV/JSON func...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23071 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23071: [SPARK-26102][SQL][TEST] Extracting common CSV/JSON func...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23071 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98970/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23071: [SPARK-26102][SQL][TEST] Extracting common CSV/JSON func...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23071 **[Test build #98970 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98970/testReport)** for PR 23071 at commit [`3884aa3`](https://github.com/apache/spark/commit/3884aa39824914d1f710589b8c1a691780b04cc8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23072 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23072 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5114/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23072 **[Test build #98971 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98971/testReport)** for PR 23072 at commit [`9e2b0f9`](https://github.com/apache/spark/commit/9e2b0f9ffe0866fa328bc677500e4f3a49ff384b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/23072 [SPARK-19827][R]spark.ml R API for PIC ## What changes were proposed in this pull request? Add PowerIterationCluster (PIC) in R ## How was this patch tested? Add test case You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-19827 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23072.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23072 commit 9e2b0f9ffe0866fa328bc677500e4f3a49ff384b Author: Huaxin Gao Date: 2018-11-17T21:25:46Z [SPARK-19827][R]spark.ml R API for PIC --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23066: [SPARK-26043][CORE] Make SparkHadoopUtil private to Spar...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23066 **[Test build #4431 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4431/testReport)** for PR 23066 at commit [`5a79b2e`](https://github.com/apache/spark/commit/5a79b2e73a658b5fffd6b605b109b63cd1c887e2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23071: [SPARK-26102][SQL][TEST] Extracting common CSV/JSON func...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23071 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23057 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23071: [SPARK-26102][SQL][TEST] Extracting common CSV/JS...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/23071 [SPARK-26102][SQL][TEST] Extracting common CSV/JSON functions tests ## What changes were proposed in this pull request? Extracted common tests from `CsvFunctionsSuite` and `JsonFunctionsSuite` to the `FunctionsTests` trait. ## How was this patch tested? by `CsvFunctionsSuite` and `JsonFunctionsSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 common-functions-tests Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23071.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23071 commit 4b5b7fb7801d8d72be4db3781d8ca439b6eb102d Author: Maxim Gekk Date: 2018-11-17T16:26:29Z Extract common test to FunctionsTests commit 4b417ac974f67e6d61086a3c41abbc25854e150c Author: Maxim Gekk Date: 2018-11-17T17:33:56Z Extracted additional tests 1 commit 974aa8da4c7399205306baf2b34d1e8cb37d75c2 Author: Maxim Gekk Date: 2018-11-17T17:34:36Z Removing unused imports commit 3884aa39824914d1f710589b8c1a691780b04cc8 Author: Maxim Gekk Date: 2018-11-17T18:18:02Z Extracting the rest tests from CsvFunctionsSuite --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23071: [SPARK-26102][SQL][TEST] Extracting common CSV/JSON func...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23071 @dongjoon-hyun May I ask you to review the PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23071: [SPARK-26102][SQL][TEST] Extracting common CSV/JSON func...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23071 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5113/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23071: [SPARK-26102][SQL][TEST] Extracting common CSV/JSON func...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23071 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23057 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98969/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23057 **[Test build #98969 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98969/testReport)** for PR 23057 at commit [`86106fa`](https://github.com/apache/spark/commit/86106fadcaed6c1a4768138b3d72e8c892b7cd7f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23071: [SPARK-26102][SQL][TEST] Extracting common CSV/JSON func...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23071 **[Test build #98970 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98970/testReport)** for PR 23071 at commit [`3884aa3`](https://github.com/apache/spark/commit/3884aa39824914d1f710589b8c1a691780b04cc8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23038: [SPARK-25451][SPARK-26100][CORE]Aggregated metrics table...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23038 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23038: [SPARK-25451][SPARK-26100][CORE]Aggregated metrics table...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23038 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98966/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23038: [SPARK-25451][SPARK-26100][CORE]Aggregated metrics table...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23038 **[Test build #98966 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98966/testReport)** for PR 23038 at commit [`ad30c36`](https://github.com/apache/spark/commit/ad30c36f63a0b7f14b69d1699291ed9cec591af6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of primitive type under...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23054 We should add a âlegacyâ flag in case somebodyâs workload gets broken by this. We can remove the legacy flag in a future release. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23038: [SPARK-25451][SPARK-26100][CORE]Aggregated metrics table...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23038 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98967/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23038: [SPARK-25451][SPARK-26100][CORE]Aggregated metrics table...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23038 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23038: [SPARK-25451][SPARK-26100][CORE]Aggregated metrics table...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23038 **[Test build #98967 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98967/testReport)** for PR 23038 at commit [`dca941d`](https://github.com/apache/spark/commit/dca941d316543526ea429c2b6a993c2252d09fd6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23038: [SPARK-25451][SPARK-26100][CORE]Aggregated metrics table...
Github user shahidki31 commented on the issue: https://github.com/apache/spark/pull/23038 It is random failure --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23070: [SPARK-26099][SQL] Verification of the corrupt column in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23070 L9oks good to me. I or someone else should take a closer look tho. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23043: [SPARK-26021][SQL] replace minus zero with zero in Unsaf...
Github user adoron commented on the issue: https://github.com/apache/spark/pull/23043 @cloud-fan changing writeDouble/writeFloat in UnsafeWriter indeed do the trick! I'll fix the PR. I was thinking about making the change in `Platform::putDouble` so all accesses get affected, in UnsafeRow and UnsafeWriter as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23038: [SPARK-25451][SPARK-26100][CORE]Aggregated metrics table...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23038 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98968/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23038: [SPARK-25451][SPARK-26100][CORE]Aggregated metrics table...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23038 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23038: [SPARK-25451][SPARK-26100][CORE]Aggregated metrics table...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23038 **[Test build #98968 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98968/testReport)** for PR 23038 at commit [`a21bc0c`](https://github.com/apache/spark/commit/a21bc0c3a24a468bf8147c7ee6f7ef12e384c454). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23070: [SPARK-26099][SQL] Verification of the corrupt column in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23070 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23070: [SPARK-26099][SQL] Verification of the corrupt column in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23070 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98965/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23070: [SPARK-26099][SQL] Verification of the corrupt column in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23070 **[Test build #98965 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98965/testReport)** for PR 23070 at commit [`bd2debc`](https://github.com/apache/spark/commit/bd2debcc2237ad178ef00b762bcdc80b63d1ecb7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23069: [SPARK-26026][BUILD] Published Scaladoc jars missing fro...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23069 **[Test build #4430 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4430/testReport)** for PR 23069 at commit [`25a311b`](https://github.com/apache/spark/commit/25a311beb9da709b61931dca12a7c443f43efa65). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...
Github user shahidki31 commented on the issue: https://github.com/apache/spark/pull/23016 Thank you @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23057: [SPARK-26078][SQL] Dedup self-join attributes on ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23057#discussion_r234412635 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala --- @@ -119,7 +139,7 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper { // (A.A1 = B.B1 OR ISNULL(A.A1 = B.B1)) AND (B.B2 = A.A2) AND B.B3 > 1 val finalJoinCond = (nullAwareJoinConds ++ conditions).reduceLeft(And) // Deduplicate conflicting attributes if any. - dedupJoin(Join(outerPlan, sub, LeftAnti, Option(finalJoinCond))) + dedupJoin(Join(outerPlan, newSub, LeftAnti, Option(finalJoinCond))) case (p, predicate) => val (newCond, inputPlan) = rewriteExistentialExpr(Seq(predicate), p) Project(p.output, Filter(newCond.get, inputPlan)) --- End diff -- Can you try this test case? ```scala val df1 = spark.sql( """ |SELECT id,num,source FROM ( | SELECT id, num, 'a' as source FROM a | UNION ALL | SELECT id, num, 'b' as source FROM b |) AS c WHERE c.id IN (SELECT id FROM b WHERE num = 2) OR |c.id IN (SELECT id FROM b WHERE num = 3) """.stripMargin) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23065: [SPARK-26090][CORE][SQL][ML] Resolve most miscellaneous ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23065 **[Test build #4429 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4429/testReport)** for PR 23065 at commit [`e2e375b`](https://github.com/apache/spark/commit/e2e375b592ccbbf2e468736fb2ee00b33787c58e). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22986: [SPARK-25959][ML] GBTClassifier picks wrong impur...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22986 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22986: [SPARK-25959][ML] GBTClassifier picks wrong impur...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/22986#discussion_r234412241 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -258,11 +258,7 @@ private[ml] object TreeClassifierParams { private[ml] trait DecisionTreeClassifierParams extends DecisionTreeParams with TreeClassifierParams -/** - * Parameters for Decision Tree-based regression algorithms. - */ -private[ml] trait TreeRegressorParams extends Params { --- End diff -- I see. I am not sure how to verify that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr'...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23016 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22986: [SPARK-25959][ML] GBTClassifier picks wrong impur...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22986#discussion_r234412163 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -258,11 +258,7 @@ private[ml] object TreeClassifierParams { private[ml] trait DecisionTreeClassifierParams extends DecisionTreeParams with TreeClassifierParams -/** - * Parameters for Decision Tree-based regression algorithms. - */ -private[ml] trait TreeRegressorParams extends Params { --- End diff -- That's true, but I don't know if we can back-port it because of the binary incompatibility, internal as it may be. I don't know. If it's not an issue then yes it can back port --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22986: [SPARK-25959][ML] GBTClassifier picks wrong impurity sta...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22986 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23016 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23066: [SPARK-26043][CORE] Make SparkHadoopUtil private to Spar...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23066 **[Test build #4431 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4431/testReport)** for PR 23066 at commit [`5a79b2e`](https://github.com/apache/spark/commit/5a79b2e73a658b5fffd6b605b109b63cd1c887e2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23070: [SPARK-26099][SQL] Verification of the corrupt column in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23070 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98963/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23070: [SPARK-26099][SQL] Verification of the corrupt column in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23070 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23070: [SPARK-26099][SQL] Verification of the corrupt column in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23070 **[Test build #98963 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98963/testReport)** for PR 23070 at commit [`bd2debc`](https://github.com/apache/spark/commit/bd2debcc2237ad178ef00b762bcdc80b63d1ecb7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pyspark.me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23055 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pyspark.me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23055 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98962/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pyspark.me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23055 **[Test build #98962 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98962/testReport)** for PR 23055 at commit [`52a91cc`](https://github.com/apache/spark/commit/52a91cc887462227caf65eb85c0f01d5e8fd0485). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23057 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5112/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23057 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23057 **[Test build #98969 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98969/testReport)** for PR 23057 at commit [`86106fa`](https://github.com/apache/spark/commit/86106fadcaed6c1a4768138b3d72e8c892b7cd7f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23057: [SPARK-26078][SQL] Dedup self-join attributes on ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/23057#discussion_r234410124 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala --- @@ -119,7 +139,7 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper { // (A.A1 = B.B1 OR ISNULL(A.A1 = B.B1)) AND (B.B2 = A.A2) AND B.B3 > 1 val finalJoinCond = (nullAwareJoinConds ++ conditions).reduceLeft(And) // Deduplicate conflicting attributes if any. - dedupJoin(Join(outerPlan, sub, LeftAnti, Option(finalJoinCond))) + dedupJoin(Join(outerPlan, newSub, LeftAnti, Option(finalJoinCond))) case (p, predicate) => val (newCond, inputPlan) = rewriteExistentialExpr(Seq(predicate), p) Project(p.output, Filter(newCond.get, inputPlan)) --- End diff -- mmmh...`rewriteExistentialExpr` operates on the result of the `foldLeft`,so every `InSubquery` there was already transformed using `dedupSubqueryOnSelfJoin`, right? So I don't think it is needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23066: [SPARK-26043][CORE] Make SparkHadoopUtil private to Spar...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23066 **[Test build #4428 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4428/testReport)** for PR 23066 at commit [`5a79b2e`](https://github.com/apache/spark/commit/5a79b2e73a658b5fffd6b605b109b63cd1c887e2). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23068 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23068 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98960/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23057: [SPARK-26078][SQL] Dedup self-join attributes on ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23057#discussion_r234409196 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala --- @@ -119,7 +139,7 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper { // (A.A1 = B.B1 OR ISNULL(A.A1 = B.B1)) AND (B.B2 = A.A2) AND B.B3 > 1 val finalJoinCond = (nullAwareJoinConds ++ conditions).reduceLeft(And) // Deduplicate conflicting attributes if any. - dedupJoin(Join(outerPlan, sub, LeftAnti, Option(finalJoinCond))) + dedupJoin(Join(outerPlan, newSub, LeftAnti, Option(finalJoinCond))) case (p, predicate) => val (newCond, inputPlan) = rewriteExistentialExpr(Seq(predicate), p) Project(p.output, Filter(newCond.get, inputPlan)) --- End diff -- In `rewriteExistentialExpr`, there is a similar logic for `InSubquery`. Should we also do `dedupSubqueryOnSelfJoin` for it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23068 **[Test build #98960 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98960/testReport)** for PR 23068 at commit [`e7c2ebb`](https://github.com/apache/spark/commit/e7c2ebbda949918034cb9cb92ac6ef30af17d943). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23057: [SPARK-26078][SQL] Dedup self-join attributes on ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23057#discussion_r234409212 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala --- @@ -1280,4 +1281,34 @@ class SubquerySuite extends QueryTest with SharedSQLContext { assert(subqueries.length == 1) } } + + test("SPARK-26078: deduplicate fake self joins for IN subqueries") { +withTempView("a", "b") { + val a = spark.createDataFrame(spark.sparkContext.parallelize(Seq(Row("a", 2), Row("b", 1))), +StructType(Seq(StructField("id", StringType), StructField("num", IntegerType + val b = spark.createDataFrame(spark.sparkContext.parallelize(Seq(Row("a", 2), Row("b", 1))), +StructType(Seq(StructField("id", StringType), StructField("num", IntegerType --- End diff -- Two schema is the same. We can define it just once? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23057: [SPARK-26078][SQL] Dedup self-join attributes on ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23057#discussion_r234409158 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala --- @@ -70,6 +67,27 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper { case _ => joinPlan } + private def rewriteDedupPlan(plan: LogicalPlan, rewrites: AttributeMap[Alias]): LogicalPlan = { +val aliasedExpressions = plan.output.map { ref => + rewrites.getOrElse(ref, ref) +} +Project(aliasedExpressions, plan) + } + + private def dedupSubqueryOnSelfJoin(values: Seq[Expression], sub: LogicalPlan): LogicalPlan = { +val leftRefs = AttributeSet.fromAttributeSets(values.map(_.references)) +val rightRefs = AttributeSet(sub.output) --- End diff -- This is just `outputSet`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23068 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98959/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23068 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23068 **[Test build #98959 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98959/testReport)** for PR 23068 at commit [`e7c2ebb`](https://github.com/apache/spark/commit/e7c2ebbda949918034cb9cb92ac6ef30af17d943). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23038: [SPARK-25451][SPARK-26100][CORE]Aggregated metrics table...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23038 **[Test build #98968 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98968/testReport)** for PR 23038 at commit [`a21bc0c`](https://github.com/apache/spark/commit/a21bc0c3a24a468bf8147c7ee6f7ef12e384c454). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org