[GitHub] spark issue #14702: [SPARK-15694] Implement ScriptTransformation in sql/core...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14702 Can you update the description to say more about what this pr includes, and what future todos are? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user angolon commented on the issue: https://github.com/apache/spark/pull/14710 Done, sorry! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/duplicated a...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/14705 surely - we should have said `lint-r` as the baseline. There's definitely more we could add though. It would be great if we have bandwidth to write more [linters](https://github.com/jimhester/lintr/blob/master/vignettes/creating_linters.Rmd) at some point. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #8880: [SPARK-5682][Core] Add encrypted shuffle in spark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/8880 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #8880: [SPARK-5682][Core] Add encrypted shuffle in spark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/8880 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64035/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14709: [SPARK-17150][SQL] Support SQL generation for inline tab...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14709 I suspect array and struct literals will fail, looking at what Literal.sql does. That said, it's an existing problem and we can fix that later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #8880: [SPARK-5682][Core] Add encrypted shuffle in spark
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/8880 **[Test build #64035 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64035/consoleFull)** for PR 8880 at commit [`338210c`](https://github.com/apache/spark/commit/338210c21c1f9043bd58ee1bc4f84b32d4f65e7c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14700: [SPARK-17127]Make unaligned access in unsafe available f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14700 **[Test build #3226 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3226/consoleFull)** for PR 14700 at commit [`24bcf05`](https://github.com/apache/spark/commit/24bcf057311a387460af2f2fcb110a434aa53d9a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14711: [SPARK-16822] [DOC] [Support latex in scaladoc with Math...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14711 **[Test build #64044 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64044/consoleFull)** for PR 14711 at commit [`7cacb11`](https://github.com/apache/spark/commit/7cacb111390b2ca9531053444c631f914b864bc4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14118: [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV ca...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14118 Also LGTM other than that major question. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14699: [SPARK-17125][SPARKR] Allow to specify spark config usin...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/14699 It's hard to say. Right now it is being converted on the [JVM side](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L63) - so it is possible to have `1` -> `1.0` -> `"1.0"` Also `convertNamedListToEnv` are being in several other cases that seem to expect numeric type - could you check that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14711: [SPARK-16822] [DOC] [Support latex in scaladoc wi...
GitHub user jagadeesanas2 opened a pull request: https://github.com/apache/spark/pull/14711 [SPARK-16822] [DOC] [Support latex in scaladoc with MathJax] ## What changes were proposed in this pull request? LaTeX is rendered as simple code, in `LinearRegression.scala` ```scala {{{ L = 1/2n||\sum_i w_i(x_i - \bar{x_i}) / \hat{x_i} - (y - \bar{y}) / \hat{y}||^2 + regTerms }}} ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/ibmsoe/spark SPARK-16822 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14711.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14711 commit 7cacb111390b2ca9531053444c631f914b864bc4 Author: JagadeesanDate: 2016-08-19T05:45:32Z [SPARK-16822] [DOC] [Support latex in scaladoc with MathJax] --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75430253 --- Diff: R/pkg/R/SQLContext.R --- @@ -727,6 +730,7 @@ dropTempView <- function(viewName) { #' @param source The name of external data source #' @param schema The data schema defined in structType #' @param na.strings Default string value for NA when source is "csv" +#' @param ... additional external data source specific named propertie(s). --- End diff -- ah sorry, I've no idea why this happened - pretty sure I've already corrected that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14118: [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV ca...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14118 With this change, do all empty (e.g. zero sized string) values become null values once they are read back? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75430152 --- Diff: R/pkg/R/functions.R --- @@ -1848,7 +1850,7 @@ setMethod("upper", #' @note var since 1.6.0 setMethod("var", signature(x = "Column"), - function(x) { + function(x, y, na.rm, use) { --- End diff -- This is done only for the purpose of documenting `y, na.rm, use`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75430048 --- Diff: R/pkg/R/functions.R --- @@ -3115,6 +3166,11 @@ setMethod("dense_rank", #' #' This is equivalent to the LAG function in SQL. #' +#' @param x the column as a character string or a Column to compute on. +#' @param offset the number of rows back from the current row from which to obtain a value. +#' If not specified, the default is 1. +#' @param defaultValue default to use when the offset row does not exist. +#' @param ... further arguments to be passed to or from other methods. --- End diff -- Since we can only give one description, I guess "further arguments" might sound more reasonable? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r7543 --- Diff: R/pkg/R/functions.R --- @@ -3115,6 +3166,11 @@ setMethod("dense_rank", #' #' This is equivalent to the LAG function in SQL. #' +#' @param x the column as a character string or a Column to compute on. +#' @param offset the number of rows back from the current row from which to obtain a value. +#' If not specified, the default is 1. +#' @param defaultValue default to use when the offset row does not exist. +#' @param ... further arguments to be passed to or from other methods. --- End diff -- This is actually the issue I raised at the end of #14558. In the same doc, it also includes generic definition, which also comes with `...` and this has the actual meaning "further arguments to be passed". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14639: [SPARK-17054][SPARKR] SparkR can not run in yarn-cluster...
Github user sun-rui commented on the issue: https://github.com/apache/spark/pull/14639 If in the future SparkConf is needed, instead of passing all spark conf to R via env variables, we can expose API for accessing SparkConf in the R backend, similar to that in Pyspark. https://github.com/apache/spark/blob/master/python/pyspark/conf.py --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE]
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14710 Can you put a more descriptive title for the change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14708: [SPARK-17149][SQL] array.sql for testing array related f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14708 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64034/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14708: [SPARK-17149][SQL] array.sql for testing array related f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14708 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...
Github user keypointt commented on the issue: https://github.com/apache/spark/pull/14447 @felixcheung sure, no problem --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75429772 --- Diff: R/pkg/R/mllib.R --- @@ -620,11 +625,12 @@ setMethod("predict", signature(object = "KMeansModel"), #' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models. #' Only categorical data is supported. #' -#' @param data A \code{SparkDataFrame} of observations and labels for model fitting -#' @param formula A symbolic description of the model to be fitted. Currently only a few formula +#' @param data a \code{SparkDataFrame} of observations and labels for model fitting. +#' @param formula a symbolic description of the model to be fitted. Currently only a few formula #' operators are supported, including '~', '.', ':', '+', and '-'. -#' @param smoothing Smoothing parameter -#' @return \code{spark.naiveBayes} returns a fitted naive Bayes model +#' @param ... additional argument(s) passed to the method. Currently only \code{smoothing}. --- End diff -- Thanks. I guess this might be caused by the merging... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14708: [SPARK-17149][SQL] array.sql for testing array related f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14708 **[Test build #64034 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64034/consoleFull)** for PR 14708 at commit [`1e89cc3`](https://github.com/apache/spark/commit/1e89cc3c35a22b7d42fe8f9ed23f16b66e92fa20). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/duplicated a...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14705 @felixcheung Thanks for kind explanation. BTW, it'd be great too if it just has a sentence, for example, `"For R code, Apache Spark follows lint-r"` in the wiki just like Python has `"For Python code, Apache Spark follows PEP 8 with one exception: lines can be up to 100 characters in length, not 79."` for just correctness and references for new contributors if it makes any sense maybe :). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75429664 --- Diff: R/pkg/R/DataFrame.R --- @@ -3187,6 +3221,7 @@ setMethod("histogram", #' @param x A SparkDataFrame #' @param url JDBC database url of the form `jdbc:subprotocol:subname` #' @param tableName The name of the table in the external database +#' @param ... additional JDBC database connection properties. --- End diff -- Done. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75429658 --- Diff: R/pkg/R/DataFrame.R --- @@ -3003,9 +3036,10 @@ setMethod("str", #' Returns a new SparkDataFrame with columns dropped. #' This is a no-op if schema doesn't contain column name(s). #' -#' @param x A SparkDataFrame. -#' @param cols A character vector of column names or a Column. -#' @return A SparkDataFrame +#' @param x a SparkDataFrame. +#' @param ... further arguments to be passed to or from other methods. --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75429536 --- Diff: R/pkg/R/DataFrame.R --- @@ -2464,8 +2489,10 @@ setMethod("unionAll", #' Union two or more SparkDataFrames. This is equivalent to `UNION ALL` in SQL. #' Note that this does not remove duplicate rows across the two SparkDataFrames. #' -#' @param x A SparkDataFrame -#' @param ... Additional SparkDataFrame +#' @param x a SparkDataFrame. +#' @param ... additional SparkDataFrame(s). +#' @param deparse.level currently not used (put here to match the signature of --- End diff -- Here is in accordance with the order of arguments in function declaration. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14639: [SPARK-17054][SPARKR] SparkR can not run in yarn-cluster...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/14639 Thanks @sun-rui, `EXISTING_SPARKR_BACKEND_PORT` do indicate cluster mode indirectly for now. But here not only deployMode is unknown in R side, but also master and other spark configurations. For now R only use master & depolyMode, but maybe in future it would use other configurations, so here I think propogating SparkConf from JVM to R is a better long term solution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75429434 --- Diff: R/pkg/R/generics.R --- @@ -735,6 +752,8 @@ setGeneric("between", function(x, bounds) { standardGeneric("between") }) setGeneric("cast", function(x, dataType) { standardGeneric("cast") }) #' @rdname columnfunctions +#' @param x a Column object. +#' @param ... additional argument(s). --- End diff -- It seems the functions are defined in the following way: column_functions2 <- c("like", "rlike", "getField", "getItem", "contains") createColumnFunction2 <- function(name) { setMethod(name, signature(x = "Column"), function(x, data) { if (class(data) == "Column") { data <- data@jc } jc <- callJMethod(x@jc, name, data) column(jc) }) } It seesm the functions are not exported. Perhaps we still leave it there? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/14447 we are looking at establishing some guidelines in PR 14705. Let's hold on for another day or 2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14639: [SPARK-17054][SPARKR] SparkR can not run in yarn-cluster...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14639 **[Test build #64043 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64043/consoleFull)** for PR 14639 at commit [`fef88cd`](https://github.com/apache/spark/commit/fef88cd5e62c65e7b5606f7baf00ecc4290ba12d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/duplicated a...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/14705 looking good - looks like we are very close. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75429158 --- Diff: R/pkg/R/mllib.R --- @@ -917,14 +922,14 @@ setMethod("spark.lda", signature(data = "SparkDataFrame"), # Returns a summary of the AFT survival regression model produced by spark.survreg, # similarly to R's summary(). -#' @param object A fitted AFT survival regression model +#' @param object a fitted AFT survival regression model. #' @return \code{summary} returns a list containing the model's coefficients, #' intercept and log(scale) #' @rdname spark.survreg #' @export #' @note summary(AFTSurvivalRegressionModel) since 2.0.0 setMethod("summary", signature(object = "AFTSurvivalRegressionModel"), - function(object, ...) { + function(object) { --- End diff -- probably I was more wondering why CRAN check didn't flag this... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75429017 --- Diff: R/pkg/R/mllib.R --- @@ -504,14 +504,15 @@ setMethod("summary", signature(object = "IsotonicRegressionModel"), #' Users can call \code{summary} to print a summary of the fitted model, \code{predict} to make #' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models. #' -#' @param data SparkDataFrame for training -#' @param formula A symbolic description of the model to be fitted. Currently only a few formula +#' @param data a SparkDataFrame for training. +#' @param formula a symbolic description of the model to be fitted. Currently only a few formula #'operators are supported, including '~', '.', ':', '+', and '-'. #'Note that the response variable of formula is empty in spark.kmeans. -#' @param k Number of centers -#' @param maxIter Maximum iteration number -#' @param initMode The initialization algorithm choosen to fit the model -#' @return \code{spark.kmeans} returns a fitted k-means model +#' @param ... additional argument(s) passed to the method. --- End diff -- Yeah, didn't notice this. Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/duplicated a...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/14705 @inheritParams would be the way to go. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/duplicated a...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/14705 @HyukjinKwon - we don't have a coding style guide for R. We have some style check with lint-r. In addition, the document style you are looking at is a bit different from coding style - this document style I'm planning to write one after this is merged. Perhaps a coding style could be good too, for things like eg. "what to do with method without parameter". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13796: [SPARK-7159][ML] Add multiclass logistic regression to S...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/13796 @dbtsai Thanks for all of your meticulous review. Very much appreciated! Glad we can have MLOR in Spark ML now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/duplicated a...
Github user junyangq commented on the issue: https://github.com/apache/spark/pull/14705 @shivaram I found perhaps a neat way to document R'glm if we don't want to remove it is to use `@inheritParams stats::glm`. That will bring in all the parameters from `stats::glm` not listed in SparkR's glm. That also means we need slight modification of the `data` description: something like "a SparkDataFrame or R's glm data for training." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/13796#discussion_r75428713 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala --- @@ -0,0 +1,611 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.classification + +import scala.collection.mutable + +import breeze.linalg.{DenseVector => BDV} +import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => BreezeOWLQN} +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.internal.Logging +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.linalg.VectorImplicits._ +import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{Dataset, Row} +import org.apache.spark.sql.functions.{col, lit} +import org.apache.spark.sql.types.DoubleType +import org.apache.spark.storage.StorageLevel + +/** + * Params for multinomial logistic (softmax) regression. + */ +private[classification] trait MultinomialLogisticRegressionParams + extends ProbabilisticClassifierParams with HasRegParam with HasElasticNetParam with HasMaxIter +with HasFitIntercept with HasTol with HasStandardization with HasWeightCol { + + /** + * Set thresholds in multiclass (or binary) classification to adjust the probability of + * predicting each class. Array must have length equal to the number of classes, with values >= 0. + * The class with largest value p/t is predicted, where p is the original probability of that + * class and t is the class' threshold. + * + * @group setParam + */ + def setThresholds(value: Array[Double]): this.type = { +set(thresholds, value) + } + + /** + * Get thresholds for binary or multiclass classification. + * + * @group getParam + */ + override def getThresholds: Array[Double] = { +$(thresholds) + } +} + +/** + * :: Experimental :: + * Multinomial Logistic (softmax) regression. + */ +@Since("2.1.0") +@Experimental +class MultinomialLogisticRegression @Since("2.1.0") ( +@Since("2.1.0") override val uid: String) + extends ProbabilisticClassifier[Vector, +MultinomialLogisticRegression, MultinomialLogisticRegressionModel] +with MultinomialLogisticRegressionParams with DefaultParamsWritable with Logging { + + @Since("2.1.0") + def this() = this(Identifiable.randomUID("mlogreg")) + + /** + * Set the regularization parameter. + * Default is 0.0. + * + * @group setParam + */ + @Since("2.1.0") + def setRegParam(value: Double): this.type = set(regParam, value) + setDefault(regParam -> 0.0) + + /** + * Set the ElasticNet mixing parameter. + * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. + * For 0 < alpha < 1, the penalty is a combination of L1 and L2. + * Default is 0.0 which is an L2 penalty. + * + * @group setParam + */ + @Since("2.1.0") + def setElasticNetParam(value: Double): this.type = set(elasticNetParam, value) + setDefault(elasticNetParam -> 0.0) + + /** + * Set the maximum number of iterations. + * Default is 100. + * + * @group setParam + */ + @Since("2.1.0") + def setMaxIter(value: Int): this.type = set(maxIter, value) + setDefault(maxIter -> 100) + + /** + * Set the convergence tolerance of iterations. + * Smaller value will lead to higher accuracy with the cost of more iterations. + * Default is
[GitHub] spark issue #13796: [SPARK-7159][ML] Add multiclass logistic regression to S...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/13796 @sethah Thank you for this great weighted MLOR work in Spark 2.1. I merged this PR into master, and let's discuss/work on the followups in separate JIRAs. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13796 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14279: [SPARK-16216][SQL] Read/write timestamps and dates in IS...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14279 BTW I think this is pretty important for 2.0.1 release. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14279: [SPARK-16216][SQL] Read/write timestamps and dates in IS...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14279 If we are introducing breaking changes to fix the bugs here, let's fix it for real. (definitely a problem if we can't specify dateFormat and timestampFormat separately). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/13796#discussion_r75428163 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala --- @@ -0,0 +1,611 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.classification + +import scala.collection.mutable + +import breeze.linalg.{DenseVector => BDV} +import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => BreezeOWLQN} +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.internal.Logging +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.linalg.VectorImplicits._ +import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{Dataset, Row} +import org.apache.spark.sql.functions.{col, lit} +import org.apache.spark.sql.types.DoubleType +import org.apache.spark.storage.StorageLevel + +/** + * Params for multinomial logistic (softmax) regression. + */ +private[classification] trait MultinomialLogisticRegressionParams + extends ProbabilisticClassifierParams with HasRegParam with HasElasticNetParam with HasMaxIter +with HasFitIntercept with HasTol with HasStandardization with HasWeightCol { + + /** + * Set thresholds in multiclass (or binary) classification to adjust the probability of + * predicting each class. Array must have length equal to the number of classes, with values >= 0. + * The class with largest value p/t is predicted, where p is the original probability of that + * class and t is the class' threshold. + * + * @group setParam + */ + def setThresholds(value: Array[Double]): this.type = { +set(thresholds, value) + } + + /** + * Get thresholds for binary or multiclass classification. + * + * @group getParam + */ + override def getThresholds: Array[Double] = { +$(thresholds) + } +} + +/** + * :: Experimental :: + * Multinomial Logistic (softmax) regression. + */ +@Since("2.1.0") +@Experimental +class MultinomialLogisticRegression @Since("2.1.0") ( +@Since("2.1.0") override val uid: String) + extends ProbabilisticClassifier[Vector, +MultinomialLogisticRegression, MultinomialLogisticRegressionModel] +with MultinomialLogisticRegressionParams with DefaultParamsWritable with Logging { + + @Since("2.1.0") + def this() = this(Identifiable.randomUID("mlogreg")) + + /** + * Set the regularization parameter. + * Default is 0.0. + * + * @group setParam + */ + @Since("2.1.0") + def setRegParam(value: Double): this.type = set(regParam, value) + setDefault(regParam -> 0.0) + + /** + * Set the ElasticNet mixing parameter. + * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. + * For 0 < alpha < 1, the penalty is a combination of L1 and L2. + * Default is 0.0 which is an L2 penalty. + * + * @group setParam + */ + @Since("2.1.0") + def setElasticNetParam(value: Double): this.type = set(elasticNetParam, value) + setDefault(elasticNetParam -> 0.0) + + /** + * Set the maximum number of iterations. + * Default is 100. + * + * @group setParam + */ + @Since("2.1.0") + def setMaxIter(value: Int): this.type = set(maxIter, value) + setDefault(maxIter -> 100) + + /** + * Set the convergence tolerance of iterations. + * Smaller value will lead to higher accuracy with the cost of more iterations. + * Default is
[GitHub] spark issue #14710: [SPARK-16533][CORE]
Github user petermaxlee commented on the issue: https://github.com/apache/spark/pull/14710 cc @vanzin and @kayousterhout --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75427819 --- Diff: R/pkg/R/DataFrame.R --- @@ -1719,12 +1732,13 @@ setMethod("[", signature(x = "SparkDataFrame"), #' Subset #' #' Return subsets of SparkDataFrame according to given conditions -#' @param x A SparkDataFrame -#' @param subset (Optional) A logical expression to filter on rows -#' @param select expression for the single Column or a list of columns to select from the SparkDataFrame +#' @param x a SparkDataFrame. +#' @param i,subset (Optional) a logical expression to filter on rows. +#' @param j,select expression for the single Column or a list of columns to select from the SparkDataFrame. --- End diff -- perhaps better to rename `i`->`subset` and `j`->`select`? I didn't find a reason for having `i` and `j` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14222: [SPARK-16391][SQL] KeyValueGroupedDataset.reduceGroups s...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14222 Close this now since the pr #14576 is merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14222: [SPARK-16391][SQL] KeyValueGroupedDataset.reduceG...
Github user viirya closed the pull request at: https://github.com/apache/spark/pull/14222 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75427717 --- Diff: R/pkg/R/mllib.R --- @@ -917,14 +922,14 @@ setMethod("spark.lda", signature(data = "SparkDataFrame"), # Returns a summary of the AFT survival regression model produced by spark.survreg, # similarly to R's summary(). -#' @param object A fitted AFT survival regression model +#' @param object a fitted AFT survival regression model. #' @return \code{summary} returns a list containing the model's coefficients, #' intercept and log(scale) #' @rdname spark.survreg #' @export #' @note summary(AFTSurvivalRegressionModel) since 2.0.0 setMethod("summary", signature(object = "AFTSurvivalRegressionModel"), - function(object, ...) { + function(object) { --- End diff -- We have `...` for `summary`? That is used to match the `base::summary` signature. I am not completely sure about the exact reason, but I read from the [doc for Methods](https://stat.ethz.ch/R-manual/R-devel/library/methods/html/Methods.html) saying "By default, the signature of the generic consists of all the formal arguments except ..., in the order they appear in the function definition." Does that perhaps explain that behavior? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75427619 --- Diff: R/pkg/R/functions.R --- @@ -1848,7 +1850,7 @@ setMethod("upper", #' @note var since 1.6.0 setMethod("var", signature(x = "Column"), - function(x) { + function(x, y, na.rm, use) { --- End diff -- please see the example for `sd` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75427590 --- Diff: R/pkg/R/functions.R --- @@ -1335,7 +1336,7 @@ setMethod("rtrim", #' @note sd since 1.6.0 setMethod("sd", signature(x = "Column"), - function(x) { + function(x, na.rm) { --- End diff -- It seems to work: ``` > setGeneric("sd", function(x, na.rm = FALSE) { standardGeneric("sd") }) Creating a new generic function for âsdâ in the global environment [1] "sd" > setMethod("sd", signature(x = "character"), function(x) { print("blah") }) [1] "sd" > sd(1) [1] NA > sd(1:2) [1] 0.7071068 > sd("abc") [1] "blah" ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13950: [SPARK-15487] [Web UI] Spark Master UI to reverse proxy ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13950 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64031/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13950: [SPARK-15487] [Web UI] Spark Master UI to reverse proxy ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13950 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13950: [SPARK-15487] [Web UI] Spark Master UI to reverse proxy ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13950 **[Test build #64031 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64031/consoleFull)** for PR 13950 at commit [`032ac0e`](https://github.com/apache/spark/commit/032ac0ecfb14fba2a0d2872b406993243d39ca8b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75427102 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1088,109 +1108,86 @@ private[spark] class BlockManager( } /** - * Replicate block to another node. Not that this is a blocking call that returns after + * Replicate block to another node. Note that this is a blocking call that returns after * the block has been replicated. + * + * @param blockId + * @param data + * @param level + * @param classTag */ private def replicate( - blockId: BlockId, - data: ChunkedByteBuffer, - level: StorageLevel, - classTag: ClassTag[_]): Unit = { +blockId: BlockId, +data: ChunkedByteBuffer, +level: StorageLevel, +classTag: ClassTag[_]): Unit = { + val maxReplicationFailures = conf.getInt("spark.storage.maxReplicationFailures", 1) -val numPeersToReplicateTo = level.replication - 1 -val peersForReplication = new ArrayBuffer[BlockManagerId] -val peersReplicatedTo = new ArrayBuffer[BlockManagerId] -val peersFailedToReplicateTo = new ArrayBuffer[BlockManagerId] val tLevel = StorageLevel( useDisk = level.useDisk, useMemory = level.useMemory, useOffHeap = level.useOffHeap, deserialized = level.deserialized, replication = 1) + +val numPeersToReplicateTo = level.replication - 1 + val startTime = System.currentTimeMillis -val random = new Random(blockId.hashCode) - -var replicationFailed = false -var failures = 0 -var done = false - -// Get cached list of peers -peersForReplication ++= getPeers(forceFetch = false) - -// Get a random peer. Note that this selection of a peer is deterministic on the block id. -// So assuming the list of peers does not change and no replication failures, -// if there are multiple attempts in the same node to replicate the same block, -// the same set of peers will be selected. -def getRandomPeer(): Option[BlockManagerId] = { - // If replication had failed, then force update the cached list of peers and remove the peers - // that have been already used - if (replicationFailed) { -peersForReplication.clear() -peersForReplication ++= getPeers(forceFetch = true) -peersForReplication --= peersReplicatedTo -peersForReplication --= peersFailedToReplicateTo - } - if (!peersForReplication.isEmpty) { -Some(peersForReplication(random.nextInt(peersForReplication.size))) - } else { -None - } -} -// One by one choose a random peer and try uploading the block to it -// If replication fails (e.g., target peer is down), force the list of cached peers -// to be re-fetched from driver and then pick another random peer for replication. Also -// temporarily black list the peer for which replication failed. -// -// This selection of a peer and replication is continued in a loop until one of the -// following 3 conditions is fulfilled: -// (i) specified number of peers have been replicated to -// (ii) too many failures in replicating to peers -// (iii) no peer left to replicate to -// -while (!done) { - getRandomPeer() match { -case Some(peer) => - try { -val onePeerStartTime = System.currentTimeMillis -logTrace(s"Trying to replicate $blockId of ${data.size} bytes to $peer") -blockTransferService.uploadBlockSync( - peer.host, - peer.port, - peer.executorId, - blockId, - new NettyManagedBuffer(data.toNetty), - tLevel, - classTag) -logTrace(s"Replicated $blockId of ${data.size} bytes to $peer in %s ms" - .format(System.currentTimeMillis - onePeerStartTime)) -peersReplicatedTo += peer -peersForReplication -= peer -replicationFailed = false -if (peersReplicatedTo.size == numPeersToReplicateTo) { - done = true // specified number of peers have been replicated to -} - } catch { -case e: Exception => - logWarning(s"Failed to replicate $blockId to $peer, failure #$failures", e) - failures += 1 - replicationFailed = true - peersFailedToReplicateTo += peer - if (failures > maxReplicationFailures) { //
[GitHub] spark issue #14710: [SPARK-16533][CORE]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14710 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75427019 --- Diff: R/pkg/R/DataFrame.R --- @@ -1202,6 +1215,7 @@ setMethod("toRDD", #' Groups the SparkDataFrame using the specified columns, so we can run aggregation on them. #' #' @param x a SparkDataFrame +#' @param ... variable(s) (character names(s) or Column(s)) to group on. #' @return a GroupedData --- End diff -- I basically follow the convention of the docs of many R base functions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426940 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1088,109 +1108,86 @@ private[spark] class BlockManager( } /** - * Replicate block to another node. Not that this is a blocking call that returns after + * Replicate block to another node. Note that this is a blocking call that returns after * the block has been replicated. + * + * @param blockId + * @param data + * @param level + * @param classTag */ private def replicate( - blockId: BlockId, - data: ChunkedByteBuffer, - level: StorageLevel, - classTag: ClassTag[_]): Unit = { +blockId: BlockId, +data: ChunkedByteBuffer, +level: StorageLevel, +classTag: ClassTag[_]): Unit = { + val maxReplicationFailures = conf.getInt("spark.storage.maxReplicationFailures", 1) -val numPeersToReplicateTo = level.replication - 1 -val peersForReplication = new ArrayBuffer[BlockManagerId] -val peersReplicatedTo = new ArrayBuffer[BlockManagerId] -val peersFailedToReplicateTo = new ArrayBuffer[BlockManagerId] val tLevel = StorageLevel( useDisk = level.useDisk, useMemory = level.useMemory, useOffHeap = level.useOffHeap, deserialized = level.deserialized, replication = 1) + +val numPeersToReplicateTo = level.replication - 1 + val startTime = System.currentTimeMillis -val random = new Random(blockId.hashCode) - -var replicationFailed = false -var failures = 0 -var done = false - -// Get cached list of peers -peersForReplication ++= getPeers(forceFetch = false) - -// Get a random peer. Note that this selection of a peer is deterministic on the block id. -// So assuming the list of peers does not change and no replication failures, -// if there are multiple attempts in the same node to replicate the same block, -// the same set of peers will be selected. -def getRandomPeer(): Option[BlockManagerId] = { - // If replication had failed, then force update the cached list of peers and remove the peers - // that have been already used - if (replicationFailed) { -peersForReplication.clear() -peersForReplication ++= getPeers(forceFetch = true) -peersForReplication --= peersReplicatedTo -peersForReplication --= peersFailedToReplicateTo - } - if (!peersForReplication.isEmpty) { -Some(peersForReplication(random.nextInt(peersForReplication.size))) - } else { -None - } -} -// One by one choose a random peer and try uploading the block to it -// If replication fails (e.g., target peer is down), force the list of cached peers -// to be re-fetched from driver and then pick another random peer for replication. Also -// temporarily black list the peer for which replication failed. -// -// This selection of a peer and replication is continued in a loop until one of the -// following 3 conditions is fulfilled: -// (i) specified number of peers have been replicated to -// (ii) too many failures in replicating to peers -// (iii) no peer left to replicate to -// -while (!done) { - getRandomPeer() match { -case Some(peer) => - try { -val onePeerStartTime = System.currentTimeMillis -logTrace(s"Trying to replicate $blockId of ${data.size} bytes to $peer") -blockTransferService.uploadBlockSync( - peer.host, - peer.port, - peer.executorId, - blockId, - new NettyManagedBuffer(data.toNetty), - tLevel, - classTag) -logTrace(s"Replicated $blockId of ${data.size} bytes to $peer in %s ms" - .format(System.currentTimeMillis - onePeerStartTime)) -peersReplicatedTo += peer -peersForReplication -= peer -replicationFailed = false -if (peersReplicatedTo.size == numPeersToReplicateTo) { - done = true // specified number of peers have been replicated to -} - } catch { -case e: Exception => - logWarning(s"Failed to replicate $blockId to $peer, failure #$failures", e) - failures += 1 - replicationFailed = true - peersFailedToReplicateTo += peer - if (failures > maxReplicationFailures) { //
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426951 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1088,109 +1108,86 @@ private[spark] class BlockManager( } /** - * Replicate block to another node. Not that this is a blocking call that returns after + * Replicate block to another node. Note that this is a blocking call that returns after * the block has been replicated. + * + * @param blockId + * @param data + * @param level + * @param classTag */ private def replicate( - blockId: BlockId, - data: ChunkedByteBuffer, - level: StorageLevel, - classTag: ClassTag[_]): Unit = { +blockId: BlockId, +data: ChunkedByteBuffer, +level: StorageLevel, +classTag: ClassTag[_]): Unit = { + val maxReplicationFailures = conf.getInt("spark.storage.maxReplicationFailures", 1) -val numPeersToReplicateTo = level.replication - 1 -val peersForReplication = new ArrayBuffer[BlockManagerId] -val peersReplicatedTo = new ArrayBuffer[BlockManagerId] -val peersFailedToReplicateTo = new ArrayBuffer[BlockManagerId] val tLevel = StorageLevel( useDisk = level.useDisk, useMemory = level.useMemory, useOffHeap = level.useOffHeap, deserialized = level.deserialized, replication = 1) + +val numPeersToReplicateTo = level.replication - 1 + val startTime = System.currentTimeMillis -val random = new Random(blockId.hashCode) - -var replicationFailed = false -var failures = 0 -var done = false - -// Get cached list of peers -peersForReplication ++= getPeers(forceFetch = false) - -// Get a random peer. Note that this selection of a peer is deterministic on the block id. -// So assuming the list of peers does not change and no replication failures, -// if there are multiple attempts in the same node to replicate the same block, -// the same set of peers will be selected. -def getRandomPeer(): Option[BlockManagerId] = { - // If replication had failed, then force update the cached list of peers and remove the peers - // that have been already used - if (replicationFailed) { -peersForReplication.clear() -peersForReplication ++= getPeers(forceFetch = true) -peersForReplication --= peersReplicatedTo -peersForReplication --= peersFailedToReplicateTo - } - if (!peersForReplication.isEmpty) { -Some(peersForReplication(random.nextInt(peersForReplication.size))) - } else { -None - } -} -// One by one choose a random peer and try uploading the block to it -// If replication fails (e.g., target peer is down), force the list of cached peers -// to be re-fetched from driver and then pick another random peer for replication. Also -// temporarily black list the peer for which replication failed. -// -// This selection of a peer and replication is continued in a loop until one of the -// following 3 conditions is fulfilled: -// (i) specified number of peers have been replicated to -// (ii) too many failures in replicating to peers -// (iii) no peer left to replicate to -// -while (!done) { - getRandomPeer() match { -case Some(peer) => - try { -val onePeerStartTime = System.currentTimeMillis -logTrace(s"Trying to replicate $blockId of ${data.size} bytes to $peer") -blockTransferService.uploadBlockSync( - peer.host, - peer.port, - peer.executorId, - blockId, - new NettyManagedBuffer(data.toNetty), - tLevel, - classTag) -logTrace(s"Replicated $blockId of ${data.size} bytes to $peer in %s ms" - .format(System.currentTimeMillis - onePeerStartTime)) -peersReplicatedTo += peer -peersForReplication -= peer -replicationFailed = false -if (peersReplicatedTo.size == numPeersToReplicateTo) { - done = true // specified number of peers have been replicated to -} - } catch { -case e: Exception => - logWarning(s"Failed to replicate $blockId to $peer, failure #$failures", e) - failures += 1 - replicationFailed = true - peersFailedToReplicateTo += peer - if (failures > maxReplicationFailures) { //
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75426918 --- Diff: R/pkg/R/DataFrame.R --- @@ -1202,6 +1215,7 @@ setMethod("toRDD", #' Groups the SparkDataFrame using the specified columns, so we can run aggregation on them. #' #' @param x a SparkDataFrame +#' @param ... variable(s) (character names(s) or Column(s)) to group on. #' @return a GroupedData --- End diff -- Yeah, thanks for the catch. This makes perfect sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14710: [SPARK-16533][CORE]
GitHub user angolon opened a pull request: https://github.com/apache/spark/pull/14710 [SPARK-16533][CORE] ## What changes were proposed in this pull request? This pull request reverts the changes made as a part of #14605, which simply side-steps the deadlock issue. Instead, I propose the following approach: * Use `scheduleWithFixedDelay` when calling `ExecutorAllocationManager.schedule` for scheduling executor requests. The intent of this is that if invocations are delayed beyond the default schedule interval on account of lock contention, then we avoid a situation where calls to `schedule` are made back-to-back, potentially releasing and then immediately reacquiring these locks - further exacerbating contention. * Replace a number of calls to `askWithRetry` with `ask` inside of message handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us queue messages with the relevant endpoints, release whatever locks we might be holding, and then block whilst awaiting the response. This change is made at the cost of being able to retry should sending the message fail, as retrying outside of the lock could easily cause race conditions if other conflicting messages have been sent whilst awaiting a response. I believe this to be the lesser of two evils, as in many cases these RPC calls are to process local components, and so failures are more likely to be deterministic, and timeouts are more likely to be caused by lock contention. ## How was this patch tested? Existing tests, and manual tests under yarn-client mode. You can merge this pull request into a Git repository by running: $ git pull https://github.com/angolon/spark SPARK-16533 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14710.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14710 commit cef69bf470199c63b6638933756b1d057dc890d1 Author: Angus GerryDate: 2016-08-19T01:52:58Z Revert "[SPARK-17022][YARN] Handle potential deadlock in driver handling messages" This reverts commit ea0bf91b4a2ca3ef472906e50e31fd6268b6f53e. commit 4970b3b0bcd834bbe5d5473a3065f04a48b12643 Author: Angus Gerry Date: 2016-08-09T04:45:29Z [SPARK-16533][CORE] Use scheduleWithFixedDelay when calling ExecutorAllocatorManager.schedule to ease contention on locks. commit 920274a3ed0b8278d38d721587a24c9441fa5ff3 Author: Angus Gerry Date: 2016-08-04T06:27:56Z [SPARK-16533][CORE] Replace many calls to askWithRetry to plain old ask. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426791 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1088,109 +1108,86 @@ private[spark] class BlockManager( } /** - * Replicate block to another node. Not that this is a blocking call that returns after + * Replicate block to another node. Note that this is a blocking call that returns after * the block has been replicated. + * + * @param blockId + * @param data + * @param level + * @param classTag */ private def replicate( - blockId: BlockId, - data: ChunkedByteBuffer, - level: StorageLevel, - classTag: ClassTag[_]): Unit = { +blockId: BlockId, +data: ChunkedByteBuffer, +level: StorageLevel, +classTag: ClassTag[_]): Unit = { + val maxReplicationFailures = conf.getInt("spark.storage.maxReplicationFailures", 1) -val numPeersToReplicateTo = level.replication - 1 -val peersForReplication = new ArrayBuffer[BlockManagerId] -val peersReplicatedTo = new ArrayBuffer[BlockManagerId] -val peersFailedToReplicateTo = new ArrayBuffer[BlockManagerId] val tLevel = StorageLevel( useDisk = level.useDisk, useMemory = level.useMemory, useOffHeap = level.useOffHeap, deserialized = level.deserialized, replication = 1) + +val numPeersToReplicateTo = level.replication - 1 + val startTime = System.currentTimeMillis -val random = new Random(blockId.hashCode) - -var replicationFailed = false -var failures = 0 -var done = false - -// Get cached list of peers -peersForReplication ++= getPeers(forceFetch = false) - -// Get a random peer. Note that this selection of a peer is deterministic on the block id. -// So assuming the list of peers does not change and no replication failures, -// if there are multiple attempts in the same node to replicate the same block, -// the same set of peers will be selected. -def getRandomPeer(): Option[BlockManagerId] = { - // If replication had failed, then force update the cached list of peers and remove the peers - // that have been already used - if (replicationFailed) { -peersForReplication.clear() -peersForReplication ++= getPeers(forceFetch = true) -peersForReplication --= peersReplicatedTo -peersForReplication --= peersFailedToReplicateTo - } - if (!peersForReplication.isEmpty) { -Some(peersForReplication(random.nextInt(peersForReplication.size))) - } else { -None - } -} -// One by one choose a random peer and try uploading the block to it -// If replication fails (e.g., target peer is down), force the list of cached peers -// to be re-fetched from driver and then pick another random peer for replication. Also -// temporarily black list the peer for which replication failed. -// -// This selection of a peer and replication is continued in a loop until one of the -// following 3 conditions is fulfilled: -// (i) specified number of peers have been replicated to -// (ii) too many failures in replicating to peers -// (iii) no peer left to replicate to -// -while (!done) { - getRandomPeer() match { -case Some(peer) => - try { -val onePeerStartTime = System.currentTimeMillis -logTrace(s"Trying to replicate $blockId of ${data.size} bytes to $peer") -blockTransferService.uploadBlockSync( - peer.host, - peer.port, - peer.executorId, - blockId, - new NettyManagedBuffer(data.toNetty), - tLevel, - classTag) -logTrace(s"Replicated $blockId of ${data.size} bytes to $peer in %s ms" - .format(System.currentTimeMillis - onePeerStartTime)) -peersReplicatedTo += peer -peersForReplication -= peer -replicationFailed = false -if (peersReplicatedTo.size == numPeersToReplicateTo) { - done = true // specified number of peers have been replicated to -} - } catch { -case e: Exception => - logWarning(s"Failed to replicate $blockId to $peer, failure #$failures", e) - failures += 1 - replicationFailed = true - peersFailedToReplicateTo += peer - if (failures > maxReplicationFailures) { //
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426781 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1088,109 +1108,86 @@ private[spark] class BlockManager( } /** - * Replicate block to another node. Not that this is a blocking call that returns after + * Replicate block to another node. Note that this is a blocking call that returns after * the block has been replicated. + * + * @param blockId + * @param data + * @param level + * @param classTag */ private def replicate( - blockId: BlockId, - data: ChunkedByteBuffer, - level: StorageLevel, - classTag: ClassTag[_]): Unit = { +blockId: BlockId, +data: ChunkedByteBuffer, +level: StorageLevel, +classTag: ClassTag[_]): Unit = { + val maxReplicationFailures = conf.getInt("spark.storage.maxReplicationFailures", 1) -val numPeersToReplicateTo = level.replication - 1 -val peersForReplication = new ArrayBuffer[BlockManagerId] -val peersReplicatedTo = new ArrayBuffer[BlockManagerId] -val peersFailedToReplicateTo = new ArrayBuffer[BlockManagerId] val tLevel = StorageLevel( useDisk = level.useDisk, useMemory = level.useMemory, useOffHeap = level.useOffHeap, deserialized = level.deserialized, replication = 1) + +val numPeersToReplicateTo = level.replication - 1 + val startTime = System.currentTimeMillis -val random = new Random(blockId.hashCode) - -var replicationFailed = false -var failures = 0 -var done = false - -// Get cached list of peers -peersForReplication ++= getPeers(forceFetch = false) - -// Get a random peer. Note that this selection of a peer is deterministic on the block id. -// So assuming the list of peers does not change and no replication failures, -// if there are multiple attempts in the same node to replicate the same block, -// the same set of peers will be selected. -def getRandomPeer(): Option[BlockManagerId] = { - // If replication had failed, then force update the cached list of peers and remove the peers - // that have been already used - if (replicationFailed) { -peersForReplication.clear() -peersForReplication ++= getPeers(forceFetch = true) -peersForReplication --= peersReplicatedTo -peersForReplication --= peersFailedToReplicateTo - } - if (!peersForReplication.isEmpty) { -Some(peersForReplication(random.nextInt(peersForReplication.size))) - } else { -None - } -} -// One by one choose a random peer and try uploading the block to it -// If replication fails (e.g., target peer is down), force the list of cached peers -// to be re-fetched from driver and then pick another random peer for replication. Also -// temporarily black list the peer for which replication failed. -// -// This selection of a peer and replication is continued in a loop until one of the -// following 3 conditions is fulfilled: -// (i) specified number of peers have been replicated to -// (ii) too many failures in replicating to peers -// (iii) no peer left to replicate to -// -while (!done) { - getRandomPeer() match { -case Some(peer) => - try { -val onePeerStartTime = System.currentTimeMillis -logTrace(s"Trying to replicate $blockId of ${data.size} bytes to $peer") -blockTransferService.uploadBlockSync( - peer.host, - peer.port, - peer.executorId, - blockId, - new NettyManagedBuffer(data.toNetty), - tLevel, - classTag) -logTrace(s"Replicated $blockId of ${data.size} bytes to $peer in %s ms" - .format(System.currentTimeMillis - onePeerStartTime)) -peersReplicatedTo += peer -peersForReplication -= peer -replicationFailed = false -if (peersReplicatedTo.size == numPeersToReplicateTo) { - done = true // specified number of peers have been replicated to -} - } catch { -case e: Exception => - logWarning(s"Failed to replicate $blockId to $peer, failure #$failures", e) - failures += 1 - replicationFailed = true - peersFailedToReplicateTo += peer - if (failures > maxReplicationFailures) { //
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426753 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1088,109 +1108,86 @@ private[spark] class BlockManager( } /** - * Replicate block to another node. Not that this is a blocking call that returns after + * Replicate block to another node. Note that this is a blocking call that returns after * the block has been replicated. + * + * @param blockId + * @param data + * @param level + * @param classTag */ private def replicate( - blockId: BlockId, - data: ChunkedByteBuffer, - level: StorageLevel, - classTag: ClassTag[_]): Unit = { +blockId: BlockId, +data: ChunkedByteBuffer, +level: StorageLevel, +classTag: ClassTag[_]): Unit = { + val maxReplicationFailures = conf.getInt("spark.storage.maxReplicationFailures", 1) -val numPeersToReplicateTo = level.replication - 1 -val peersForReplication = new ArrayBuffer[BlockManagerId] -val peersReplicatedTo = new ArrayBuffer[BlockManagerId] -val peersFailedToReplicateTo = new ArrayBuffer[BlockManagerId] val tLevel = StorageLevel( useDisk = level.useDisk, useMemory = level.useMemory, useOffHeap = level.useOffHeap, deserialized = level.deserialized, replication = 1) + +val numPeersToReplicateTo = level.replication - 1 + val startTime = System.currentTimeMillis -val random = new Random(blockId.hashCode) - -var replicationFailed = false -var failures = 0 -var done = false - -// Get cached list of peers -peersForReplication ++= getPeers(forceFetch = false) - -// Get a random peer. Note that this selection of a peer is deterministic on the block id. -// So assuming the list of peers does not change and no replication failures, -// if there are multiple attempts in the same node to replicate the same block, -// the same set of peers will be selected. -def getRandomPeer(): Option[BlockManagerId] = { - // If replication had failed, then force update the cached list of peers and remove the peers - // that have been already used - if (replicationFailed) { -peersForReplication.clear() -peersForReplication ++= getPeers(forceFetch = true) -peersForReplication --= peersReplicatedTo -peersForReplication --= peersFailedToReplicateTo - } - if (!peersForReplication.isEmpty) { -Some(peersForReplication(random.nextInt(peersForReplication.size))) - } else { -None - } -} -// One by one choose a random peer and try uploading the block to it -// If replication fails (e.g., target peer is down), force the list of cached peers -// to be re-fetched from driver and then pick another random peer for replication. Also -// temporarily black list the peer for which replication failed. -// -// This selection of a peer and replication is continued in a loop until one of the -// following 3 conditions is fulfilled: -// (i) specified number of peers have been replicated to -// (ii) too many failures in replicating to peers -// (iii) no peer left to replicate to -// -while (!done) { - getRandomPeer() match { -case Some(peer) => - try { -val onePeerStartTime = System.currentTimeMillis -logTrace(s"Trying to replicate $blockId of ${data.size} bytes to $peer") -blockTransferService.uploadBlockSync( - peer.host, - peer.port, - peer.executorId, - blockId, - new NettyManagedBuffer(data.toNetty), - tLevel, - classTag) -logTrace(s"Replicated $blockId of ${data.size} bytes to $peer in %s ms" - .format(System.currentTimeMillis - onePeerStartTime)) -peersReplicatedTo += peer -peersForReplication -= peer -replicationFailed = false -if (peersReplicatedTo.size == numPeersToReplicateTo) { - done = true // specified number of peers have been replicated to -} - } catch { -case e: Exception => - logWarning(s"Failed to replicate $blockId to $peer, failure #$failures", e) - failures += 1 - replicationFailed = true - peersFailedToReplicateTo += peer - if (failures > maxReplicationFailures) { //
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426714 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1088,109 +1108,86 @@ private[spark] class BlockManager( } /** - * Replicate block to another node. Not that this is a blocking call that returns after + * Replicate block to another node. Note that this is a blocking call that returns after * the block has been replicated. + * + * @param blockId + * @param data + * @param level + * @param classTag */ private def replicate( - blockId: BlockId, - data: ChunkedByteBuffer, - level: StorageLevel, - classTag: ClassTag[_]): Unit = { +blockId: BlockId, +data: ChunkedByteBuffer, +level: StorageLevel, +classTag: ClassTag[_]): Unit = { + val maxReplicationFailures = conf.getInt("spark.storage.maxReplicationFailures", 1) -val numPeersToReplicateTo = level.replication - 1 -val peersForReplication = new ArrayBuffer[BlockManagerId] -val peersReplicatedTo = new ArrayBuffer[BlockManagerId] -val peersFailedToReplicateTo = new ArrayBuffer[BlockManagerId] val tLevel = StorageLevel( useDisk = level.useDisk, useMemory = level.useMemory, useOffHeap = level.useOffHeap, deserialized = level.deserialized, replication = 1) + +val numPeersToReplicateTo = level.replication - 1 + val startTime = System.currentTimeMillis -val random = new Random(blockId.hashCode) - -var replicationFailed = false -var failures = 0 -var done = false - -// Get cached list of peers -peersForReplication ++= getPeers(forceFetch = false) - -// Get a random peer. Note that this selection of a peer is deterministic on the block id. -// So assuming the list of peers does not change and no replication failures, -// if there are multiple attempts in the same node to replicate the same block, -// the same set of peers will be selected. -def getRandomPeer(): Option[BlockManagerId] = { - // If replication had failed, then force update the cached list of peers and remove the peers - // that have been already used - if (replicationFailed) { -peersForReplication.clear() -peersForReplication ++= getPeers(forceFetch = true) -peersForReplication --= peersReplicatedTo -peersForReplication --= peersFailedToReplicateTo - } - if (!peersForReplication.isEmpty) { -Some(peersForReplication(random.nextInt(peersForReplication.size))) - } else { -None - } -} -// One by one choose a random peer and try uploading the block to it -// If replication fails (e.g., target peer is down), force the list of cached peers -// to be re-fetched from driver and then pick another random peer for replication. Also -// temporarily black list the peer for which replication failed. -// -// This selection of a peer and replication is continued in a loop until one of the -// following 3 conditions is fulfilled: -// (i) specified number of peers have been replicated to -// (ii) too many failures in replicating to peers -// (iii) no peer left to replicate to -// -while (!done) { - getRandomPeer() match { -case Some(peer) => - try { -val onePeerStartTime = System.currentTimeMillis -logTrace(s"Trying to replicate $blockId of ${data.size} bytes to $peer") -blockTransferService.uploadBlockSync( - peer.host, - peer.port, - peer.executorId, - blockId, - new NettyManagedBuffer(data.toNetty), - tLevel, - classTag) -logTrace(s"Replicated $blockId of ${data.size} bytes to $peer in %s ms" - .format(System.currentTimeMillis - onePeerStartTime)) -peersReplicatedTo += peer -peersForReplication -= peer -replicationFailed = false -if (peersReplicatedTo.size == numPeersToReplicateTo) { - done = true // specified number of peers have been replicated to -} - } catch { -case e: Exception => - logWarning(s"Failed to replicate $blockId to $peer, failure #$failures", e) - failures += 1 - replicationFailed = true - peersFailedToReplicateTo += peer - if (failures > maxReplicationFailures) { //
[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14426 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14426 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64042/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14426 **[Test build #64042 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64042/consoleFull)** for PR 14426 at commit [`d722be2`](https://github.com/apache/spark/commit/d722be2c1660b86eb1cc23cfa1dad33095c839b7). * This patch **fails to build**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426587 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockReplicationPrioritization.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.storage + +import scala.util.Random + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.internal.Logging + +/** + * ::DeveloperApi:: + * BlockReplicationPrioritization provides logic for prioritizing a sequence of peers for + * replicating blocks. BlockManager will replicate to each peer returned in order until the + * desired replication order is reached. If a replication fails, prioritize() will be called + * again to get a fresh prioritization. + */ +@DeveloperApi +trait BlockReplicationPrioritization { --- End diff -- can we just name this BlockReplicationPolicy? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426605 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -20,6 +20,7 @@ package org.apache.spark.storage import java.io._ import java.nio.ByteBuffer +import scala.annotation.tailrec --- End diff -- is this used anywhere? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426552 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1088,109 +1108,86 @@ private[spark] class BlockManager( } /** - * Replicate block to another node. Not that this is a blocking call that returns after + * Replicate block to another node. Note that this is a blocking call that returns after * the block has been replicated. + * + * @param blockId + * @param data + * @param level + * @param classTag */ private def replicate( - blockId: BlockId, - data: ChunkedByteBuffer, - level: StorageLevel, - classTag: ClassTag[_]): Unit = { +blockId: BlockId, --- End diff -- reset the change here - use 4 space indent --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426567 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1088,109 +1108,86 @@ private[spark] class BlockManager( } /** - * Replicate block to another node. Not that this is a blocking call that returns after + * Replicate block to another node. Note that this is a blocking call that returns after * the block has been replicated. + * + * @param blockId --- End diff -- remove these params unless you really are going to document them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426531 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala --- @@ -69,24 +72,37 @@ class BlockManagerId private ( out.writeUTF(executorId_) out.writeUTF(host_) out.writeInt(port_) +out.writeBoolean(topologyInfo_.isDefined) +// we only write topologyInfo if we have it +topologyInfo.foreach(out.writeUTF(_)) } override def readExternal(in: ObjectInput): Unit = Utils.tryOrIOException { executorId_ = in.readUTF() host_ = in.readUTF() port_ = in.readInt() +val isTopologyInfoAvailable = in.readBoolean() +topologyInfo_ = if (isTopologyInfoAvailable) { --- End diff -- it might be more clear to do ``` if (isTopologyInfoAvailable) { topologyInfo_ = Option(in.readUTF()) } else { topologyInfo_ = None } ``` or ``` topologyInfo_ = if (isTopologyInfoAvailable) Option(in.readUTF()) else None ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426476 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala --- @@ -101,10 +117,18 @@ private[spark] object BlockManagerId { * @param execId ID of the executor. * @param host Host name of the block manager. * @param port Port of the block manager. + * @param topologyInfo topology information for the blockmanager, if available + * This can be network topology information for use while choosing peers + * while replicating data blocks. More information available here: + * [[org.apache.spark.storage.TopologyMapper]] * @return A new [[org.apache.spark.storage.BlockManagerId]]. */ - def apply(execId: String, host: String, port: Int): BlockManagerId = -getCachedBlockManagerId(new BlockManagerId(execId, host, port)) + def apply( +execId: String, --- End diff -- 4 space indent here too --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426446 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala --- @@ -298,7 +310,17 @@ class BlockManagerMasterEndpoint( ).map(_.flatten.toSeq) } - private def register(id: BlockManagerId, maxMemSize: Long, slaveEndpoint: RpcEndpointRef) { + private def register(dummyId: BlockManagerId, +maxMemSize: Long, --- End diff -- Can you also add a method doc saying this returns the same id with topology information attached? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426435 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala --- @@ -298,7 +310,17 @@ class BlockManagerMasterEndpoint( ).map(_.flatten.toSeq) } - private def register(id: BlockManagerId, maxMemSize: Long, slaveEndpoint: RpcEndpointRef) { + private def register(dummyId: BlockManagerId, +maxMemSize: Long, --- End diff -- also instead of dummyId, I'd call it "idWithoutTopologyInfo" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14426 **[Test build #64042 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64042/consoleFull)** for PR 14426 at commit [`d722be2`](https://github.com/apache/spark/commit/d722be2c1660b86eb1cc23cfa1dad33095c839b7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426412 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala --- @@ -298,7 +310,17 @@ class BlockManagerMasterEndpoint( ).map(_.flatten.toSeq) } - private def register(id: BlockManagerId, maxMemSize: Long, slaveEndpoint: RpcEndpointRef) { + private def register(dummyId: BlockManagerId, +maxMemSize: Long, --- End diff -- 4 space indent, and put all the arguments on its own line, e.g. ``` private def register( dummyId: BlockManagerId, maxMemSize: Long, slaveEndpoint: RpcEndpointRef): BlockManagerId = { ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426383 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala --- @@ -50,12 +50,20 @@ class BlockManagerMaster( logInfo("Removal of executor " + execId + " requested") } - /** Register the BlockManager's id with the driver. */ + /** + * Register the BlockManager's id with the driver. The input BlockManagerId does not contain + * topology information. This information is obtained from the master and we respond with an + * updated BlockManagerId fleshed out with this information. + */ def registerBlockManager( - blockManagerId: BlockManagerId, maxMemSize: Long, slaveEndpoint: RpcEndpointRef): Unit = { +blockManagerId: BlockManagerId, --- End diff -- indent 4 spaces --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426300 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -160,8 +163,25 @@ private[spark] class BlockManager( blockTransferService.init(this) shuffleClient.init(appId) -blockManagerId = BlockManagerId( - executorId, blockTransferService.hostName, blockTransferService.port) +blockReplicationPrioritizer = { + val priorityClass = conf.get( +"spark.replication.topologyawareness.prioritizer", +"org.apache.spark.storage.DefaultBlockReplicationPrioritization") + val clazz = Utils.classForName(priorityClass) + val ret = clazz.newInstance.asInstanceOf[BlockReplicationPrioritization] + logInfo(s"Using $priorityClass for prioritizing peers") --- End diff -- ``` Using $priorityClass for block replication policy ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14639: [SPARK-17054][SPARKR] SparkR can not run in yarn-cluster...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14639 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426231 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockReplicationPrioritization.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.storage + +import scala.util.Random + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.internal.Logging + +/** + * ::DeveloperApi:: + * BlockReplicationPrioritization provides logic for prioritizing a sequence of peers for + * replicating blocks. BlockManager will replicate to each peer returned in order until the + * desired replication order is reached. If a replication fails, prioritize() will be called + * again to get a fresh prioritization. + */ +@DeveloperApi +trait BlockReplicationPrioritization { + + /** + * Method to prioritize a bunch of candidate peers of a block + * + * @param blockManagerId Id of the current BlockManager for self identification + * @param peers A list of peers of a BlockManager + * @param peersReplicatedTo Set of peers already replicated to + * @param blockId BlockId of the block being replicated. This can be used as a source of + *randomness if needed. + * @return A prioritized list of peers. Lower the index of a peer, higher its priority + */ + def prioritize( +blockManagerId: BlockManagerId, +peers: Seq[BlockManagerId], --- End diff -- also rather than a full prioritization, can we also pass in a number of replicas wanted and just return a list there? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHEMA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14116 **[Test build #64041 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64041/consoleFull)** for PR 14116 at commit [`bd85aa5`](https://github.com/apache/spark/commit/bd85aa545e1fcb7ff10c981ef940291092cfef80). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14639: [SPARK-17054][SPARKR] SparkR can not run in yarn-cluster...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14639 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64039/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14639: [SPARK-17054][SPARKR] SparkR can not run in yarn-cluster...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14639 **[Test build #64039 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64039/consoleFull)** for PR 14639 at commit [`31ada09`](https://github.com/apache/spark/commit/31ada09b55ca34a4a6fa150037025afb831df69d). * This patch **fails R style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426199 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockReplicationPrioritization.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.storage + +import scala.util.Random + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.internal.Logging + +/** + * ::DeveloperApi:: + * BlockReplicationPrioritization provides logic for prioritizing a sequence of peers for + * replicating blocks. BlockManager will replicate to each peer returned in order until the + * desired replication order is reached. If a replication fails, prioritize() will be called + * again to get a fresh prioritization. + */ +@DeveloperApi +trait BlockReplicationPrioritization { + + /** + * Method to prioritize a bunch of candidate peers of a block + * + * @param blockManagerId Id of the current BlockManager for self identification + * @param peers A list of peers of a BlockManager + * @param peersReplicatedTo Set of peers already replicated to + * @param blockId BlockId of the block being replicated. This can be used as a source of + *randomness if needed. + * @return A prioritized list of peers. Lower the index of a peer, higher its priority + */ + def prioritize( +blockManagerId: BlockManagerId, +peers: Seq[BlockManagerId], --- End diff -- is passing in all the peers a performance concern? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426189 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockReplicationPrioritization.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.storage + +import scala.util.Random + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.internal.Logging + +/** + * ::DeveloperApi:: + * BlockReplicationPrioritization provides logic for prioritizing a sequence of peers for + * replicating blocks. BlockManager will replicate to each peer returned in order until the + * desired replication order is reached. If a replication fails, prioritize() will be called + * again to get a fresh prioritization. + */ +@DeveloperApi +trait BlockReplicationPrioritization { + + /** + * Method to prioritize a bunch of candidate peers of a block + * + * @param blockManagerId Id of the current BlockManager for self identification + * @param peers A list of peers of a BlockManager + * @param peersReplicatedTo Set of peers already replicated to + * @param blockId BlockId of the block being replicated. This can be used as a source of + *randomness if needed. + * @return A prioritized list of peers. Lower the index of a peer, higher its priority + */ + def prioritize( +blockManagerId: BlockManagerId, +peers: Seq[BlockManagerId], +peersReplicatedTo: Set[BlockManagerId], +blockId: BlockId): Seq[BlockManagerId] +} + +@DeveloperApi +class DefaultBlockReplicationPrioritization --- End diff -- instead of Default, I'd call this RandomBlockReplicationPrioritization to better reflect what it does. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426147 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockReplicationPrioritization.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.storage + +import scala.util.Random + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.internal.Logging + +/** + * ::DeveloperApi:: + * BlockReplicationPrioritization provides logic for prioritizing a sequence of peers for + * replicating blocks. BlockManager will replicate to each peer returned in order until the + * desired replication order is reached. If a replication fails, prioritize() will be called + * again to get a fresh prioritization. + */ +@DeveloperApi +trait BlockReplicationPrioritization { + + /** + * Method to prioritize a bunch of candidate peers of a block + * + * @param blockManagerId Id of the current BlockManager for self identification + * @param peers A list of peers of a BlockManager + * @param peersReplicatedTo Set of peers already replicated to + * @param blockId BlockId of the block being replicated. This can be used as a source of + *randomness if needed. + * @return A prioritized list of peers. Lower the index of a peer, higher its priority + */ + def prioritize( +blockManagerId: BlockManagerId, +peers: Seq[BlockManagerId], +peersReplicatedTo: Set[BlockManagerId], +blockId: BlockId): Seq[BlockManagerId] +} + +@DeveloperApi +class DefaultBlockReplicationPrioritization + extends BlockReplicationPrioritization + with Logging { + + /** + * Method to prioritize a bunch of candidate peers of a block. This is a basic implementation, + * that just makes sure we put blocks on different hosts, if possible + * + * @param blockManagerId Id of the current BlockManager for self identification + * @param peers A list of peers of a BlockManager + * @param peersReplicatedTo Set of peers already replicated to + * @param blockId BlockId of the block being replicated. This can be used as a source of + *randomness if needed. + * @return A prioritized list of peers. Lower the index of a peer, higher its priority + */ + override def prioritize( +blockManagerId: BlockManagerId, --- End diff -- so the Spark style for indentation is to have 4 spaces for function arguments, i.e. ```scala override def prioritize( blockManagerId: BlockManagerId,, peers: Seq[BlockManagerId], peersReplicatedTo: Set[BlockManagerId], blockId: BlockId): Seq[BlockManagerId] = { val random = new Random(blockId.hashCode) ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13152#discussion_r75426070 --- Diff: core/src/main/scala/org/apache/spark/storage/TopologyMapper.scala --- @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.storage + +import java.io.{File, FileInputStream} +import java.util.Properties + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.internal.Logging +import org.apache.spark.SparkConf +import org.apache.spark.util.Utils + +/** + * ::DeveloperApi:: + * TopologyMapper provides topology information for a given host + * @param conf SparkConf to get required properties, if needed + */ +@DeveloperApi +abstract class TopologyMapper(conf: SparkConf) { + /** + * Gets the topology information given the host name + * + * @param hostname Hostname + * @return topology information for the given hostname. One can use a 'topology delimiter' + * to make this topology information nested. + * For example : â/myrack/myhostâ, where â/â is the topology delimiter, + * âmyrackâ is the topology identifier, and âmyhostâ is the individual host. + * This function only returns the topology information without the hostname. --- End diff -- can you document what an empty string means? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14118: [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/14118#discussion_r75426062 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -370,7 +370,8 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * from values being read should be skipped. * `ignoreTrailingWhiteSpace` (default `false`): defines whether or not trailing * whitespaces from values being read should be skipped. - * `nullValue` (default empty string): sets the string representation of a null value. + * `nullValue` (default empty string): sets the string representation of a null value. Since --- End diff -- Oh thanks! Indeed there are two occurrences (one in readwriter.py / one in `streaming.py`) needs fixing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14639: [SPARK-17054][SPARKR] SparkR can not run in yarn-cluster...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14639 **[Test build #64039 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64039/consoleFull)** for PR 14639 at commit [`31ada09`](https://github.com/apache/spark/commit/31ada09b55ca34a4a6fa150037025afb831df69d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14680: [SPARK-17101][SQL] Provide consistent format identifiers...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14680 @jaceklaskowski It seems the test [here] (https://github.com/apache/spark/blob/e50efd53f073890d789a8448f850cc219cca7708/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala#L715-L724) is related with this change. It seems it will passes the test if we change `TextFileFormat` to `TEXT`. BTW, how about changing them to `Parquet` and `Text` maybe? I believe this might be about personal taste though.. I feel like `shortName.toUpperCase` is not always the string representation of each data source. I mean.. if my understanding is correct, the proper name might be `Parquet` rather than `PARQUET`, at least. It seems `ORC`, `JSON` and `CSV` are correct names because they are abbreviated names but I feel like it is questionable for `PARQUET` and `TEXT`. If the purpose of this change is only to see the information about plans to human via `explain(...)` regardless of anything, it might be better if it is more close to human readable and correct names as string representation. This is just my personal opinion. I think we need @rxin 's sign off here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14118: [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV ca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14118 **[Test build #64040 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64040/consoleFull)** for PR 14118 at commit [`74b4dd8`](https://github.com/apache/spark/commit/74b4dd8ff2f79faaf9df50c5a54e6298234137e7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13152: [SPARK-15353] [CORE] Making peer selection for block rep...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13152 **[Test build #3225 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3225/consoleFull)** for PR 13152 at commit [`9b8ce32`](https://github.com/apache/spark/commit/9b8ce3229d0cff64e77d55563cec3cc3cda29182). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75425834 --- Diff: R/pkg/R/functions.R --- @@ -362,8 +357,8 @@ setMethod("cov", signature(x = "characterOrColumn"), #' @rdname cov #' -#' @param col1 First column to compute cov_samp. -#' @param col2 Second column to compute cov_samp. +#' @param col1 the first Column object. +#' @param col2 the second Column object. --- End diff -- I'd say that applies to a couple of other cases in WindowsSpec.R or column.R too, but I'm ok either way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75425765 --- Diff: R/pkg/R/functions.R --- @@ -362,8 +357,8 @@ setMethod("cov", signature(x = "characterOrColumn"), #' @rdname cov #' -#' @param col1 First column to compute cov_samp. -#' @param col2 Second column to compute cov_samp. +#' @param col1 the first Column object. +#' @param col2 the second Column object. --- End diff -- I'd just say "the first Column", "the second Column` (without object) as in other places --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75425753 --- Diff: R/pkg/R/functions.R --- @@ -319,7 +316,7 @@ setMethod("column", #' #' Computes the Pearson Correlation Coefficient for two Columns. #' -#' @param x Column to compute on. +#' @param col2 a (second) Column object. --- End diff -- I'd just say "a (second) Column` (without object) as in other places --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14705#discussion_r75425616 --- Diff: R/pkg/R/functions.R --- @@ -1273,12 +1271,15 @@ setMethod("round", #' bround #' #' Returns the value of the column `e` rounded to `scale` decimal places using HALF_EVEN rounding -#' mode if `scale` >= 0 or at integral part when `scale` < 0. +#' mode if `scale` >= 0 or at integer part when `scale` < 0. #' Also known as Gaussian rounding or bankers' rounding that rounds to the nearest even number. #' bround(2.5, 0) = 2, bround(3.5, 0) = 4. #' #' @param x Column to compute on. -#' +#' @param scale round to \code{scale} digits to the right of the decimal point when \code{scale} > 0, +#'the nearest even number when \code{scale} = 0, and `scale` digits to the left --- End diff -- do you want `\code{scale}` in place of "`"? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14709: [SPARK-17150][SQL] Support SQL generation for inline tab...
Github user petermaxlee commented on the issue: https://github.com/apache/spark/pull/14709 cc @cloud-fan and @hvanhovell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org