date:20160818

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14702
  
Can you update the description to say more about what this pr includes, and 
what future todos are?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...

2016-08-18 Thread angolon

Github user angolon commented on the issue:

https://github.com/apache/spark/pull/14710
  
Done, sorry!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/duplicated a...

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/14705
  
surely - we should have said `lint-r` as the baseline. There's definitely 
more we could add though. It would be great if we have bandwidth to write more 
[linters](https://github.com/jimhester/lintr/blob/master/vignettes/creating_linters.Rmd)
 at some point.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #8880: [SPARK-5682][Core] Add encrypted shuffle in spark

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/8880
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #8880: [SPARK-5682][Core] Add encrypted shuffle in spark

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/8880
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64035/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14709: [SPARK-17150][SQL] Support SQL generation for inline tab...

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14709
  
I suspect array and struct literals will fail, looking at what Literal.sql 
does. That said, it's an existing problem and we can fix that later.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #8880: [SPARK-5682][Core] Add encrypted shuffle in spark

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/8880
  
**[Test build #64035 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64035/consoleFull)**
 for PR 8880 at commit 
[`338210c`](https://github.com/apache/spark/commit/338210c21c1f9043bd58ee1bc4f84b32d4f65e7c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14700: [SPARK-17127]Make unaligned access in unsafe available f...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14700
  
**[Test build #3226 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3226/consoleFull)**
 for PR 14700 at commit 
[`24bcf05`](https://github.com/apache/spark/commit/24bcf057311a387460af2f2fcb110a434aa53d9a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14711: [SPARK-16822] [DOC] [Support latex in scaladoc with Math...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14711
  
**[Test build #64044 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64044/consoleFull)**
 for PR 14711 at commit 
[`7cacb11`](https://github.com/apache/spark/commit/7cacb111390b2ca9531053444c631f914b864bc4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14118: [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV ca...

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14118
  
Also LGTM other than that major question.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14699: [SPARK-17125][SPARKR] Allow to specify spark config usin...

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/14699
  
It's hard to say. Right now it is being converted on the [JVM 
side](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L63)
 - so it is possible to have `1` -> `1.0` -> `"1.0"` 
Also `convertNamedListToEnv` are being in several other cases that seem to 
expect numeric type - could you check that?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14711: [SPARK-16822] [DOC] [Support latex in scaladoc wi...

2016-08-18 Thread jagadeesanas2

GitHub user jagadeesanas2 opened a pull request:

https://github.com/apache/spark/pull/14711

[SPARK-16822] [DOC] [Support latex in scaladoc with MathJax]

## What changes were proposed in this pull request?

LaTeX is rendered as simple code, in `LinearRegression.scala`
```scala
{{{
 L = 1/2n||\sum_i w_i(x_i - \bar{x_i}) / \hat{x_i} - (y - \bar{y}) / 
\hat{y}||^2 + regTerms
}}}
```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ibmsoe/spark SPARK-16822

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14711.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14711


commit 7cacb111390b2ca9531053444c631f914b864bc4
Author: Jagadeesan 
Date:   2016-08-19T05:45:32Z

[SPARK-16822] [DOC] [Support latex in scaladoc with MathJax]




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r75430253
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -727,6 +730,7 @@ dropTempView <- function(viewName) {
 #' @param source The name of external data source
 #' @param schema The data schema defined in structType
 #' @param na.strings Default string value for NA when source is "csv"
+#' @param ... additional external data source specific named propertie(s).
--- End diff --

ah sorry, I've no idea why this happened - pretty sure I've already 
corrected that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14118: [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV ca...

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14118
  
With this change, do all empty (e.g. zero sized string) values become null 
values once they are read back?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r75430152
  
--- Diff: R/pkg/R/functions.R ---
@@ -1848,7 +1850,7 @@ setMethod("upper",
 #' @note var since 1.6.0
 setMethod("var",
   signature(x = "Column"),
-  function(x) {
+  function(x, y, na.rm, use) {
--- End diff --

This is done only for the purpose of documenting `y, na.rm, use`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r75430048
  
--- Diff: R/pkg/R/functions.R ---
@@ -3115,6 +3166,11 @@ setMethod("dense_rank",
 #'
 #' This is equivalent to the LAG function in SQL.
 #'
+#' @param x the column as a character string or a Column to compute on.
+#' @param offset the number of rows back from the current row from which 
to obtain a value.
+#'   If not specified, the default is 1.
+#' @param defaultValue default to use when the offset row does not exist.
+#' @param ... further arguments to be passed to or from other methods.
--- End diff --

Since we can only give one description, I guess "further arguments" might 
sound more reasonable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r7543
  
--- Diff: R/pkg/R/functions.R ---
@@ -3115,6 +3166,11 @@ setMethod("dense_rank",
 #'
 #' This is equivalent to the LAG function in SQL.
 #'
+#' @param x the column as a character string or a Column to compute on.
+#' @param offset the number of rows back from the current row from which 
to obtain a value.
+#'   If not specified, the default is 1.
+#' @param defaultValue default to use when the offset row does not exist.
+#' @param ... further arguments to be passed to or from other methods.
--- End diff --

This is actually the issue I raised at the end of #14558. In the same doc, 
it also includes generic definition, which also comes with `...` and this has 
the actual meaning "further arguments to be passed". 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14639: [SPARK-17054][SPARKR] SparkR can not run in yarn-cluster...

2016-08-18 Thread sun-rui

Github user sun-rui commented on the issue:

https://github.com/apache/spark/pull/14639
  
If in the future SparkConf is needed, instead of passing all spark conf to 
R via env variables, we can expose API for accessing SparkConf in the R 
backend, similar to that in Pyspark. 
https://github.com/apache/spark/blob/master/python/pyspark/conf.py


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14710: [SPARK-16533][CORE]

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14710
  
Can you put a more descriptive title for the change?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14708: [SPARK-17149][SQL] array.sql for testing array related f...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14708
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64034/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14708: [SPARK-17149][SQL] array.sql for testing array related f...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14708
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...

2016-08-18 Thread keypointt

Github user keypointt commented on the issue:

https://github.com/apache/spark/pull/14447
  
@felixcheung sure, no problem


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r75429772
  
--- Diff: R/pkg/R/mllib.R ---
@@ -620,11 +625,12 @@ setMethod("predict", signature(object = 
"KMeansModel"),
 #' predictions on new data, and \code{write.ml}/\code{read.ml} to 
save/load fitted models.
 #' Only categorical data is supported.
 #'
-#' @param data A \code{SparkDataFrame} of observations and labels for 
model fitting
-#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#' @param data a \code{SparkDataFrame} of observations and labels for 
model fitting.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
 #'   operators are supported, including '~', '.', ':', '+', 
and '-'.
-#' @param smoothing Smoothing parameter
-#' @return \code{spark.naiveBayes} returns a fitted naive Bayes model
+#' @param ... additional argument(s) passed to the method. Currently only 
\code{smoothing}.
--- End diff --

Thanks. I guess this might be caused by the merging...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14708: [SPARK-17149][SQL] array.sql for testing array related f...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14708
  
**[Test build #64034 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64034/consoleFull)**
 for PR 14708 at commit 
[`1e89cc3`](https://github.com/apache/spark/commit/1e89cc3c35a22b7d42fe8f9ed23f16b66e92fa20).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/duplicated a...

2016-08-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14705
  
@felixcheung Thanks for kind explanation. BTW, it'd be great too if it just 
has a sentence, for example, `"For R code, Apache Spark follows lint-r"` in the 
wiki just like Python has `"For Python code, Apache Spark follows PEP 8 with 
one exception: lines can be up to 100 characters in length, not 79."` for just 
correctness and references for new contributors if it makes any sense maybe :).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r75429664
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -3187,6 +3221,7 @@ setMethod("histogram",
 #' @param x A SparkDataFrame
 #' @param url JDBC database url of the form `jdbc:subprotocol:subname`
 #' @param tableName The name of the table in the external database
+#' @param ... additional JDBC database connection properties.
--- End diff --

Done. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r75429658
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -3003,9 +3036,10 @@ setMethod("str",
 #' Returns a new SparkDataFrame with columns dropped.
 #' This is a no-op if schema doesn't contain column name(s).
 #'
-#' @param x A SparkDataFrame.
-#' @param cols A character vector of column names or a Column.
-#' @return A SparkDataFrame
+#' @param x a SparkDataFrame.
+#' @param ... further arguments to be passed to or from other methods.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r75429536
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2464,8 +2489,10 @@ setMethod("unionAll",
 #' Union two or more SparkDataFrames. This is equivalent to `UNION ALL` in 
SQL.
 #' Note that this does not remove duplicate rows across the two 
SparkDataFrames.
 #'
-#' @param x A SparkDataFrame
-#' @param ... Additional SparkDataFrame
+#' @param x a SparkDataFrame.
+#' @param ... additional SparkDataFrame(s).
+#' @param deparse.level currently not used (put here to match the 
signature of
--- End diff --

Here is in accordance with the order of arguments in function declaration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14639: [SPARK-17054][SPARKR] SparkR can not run in yarn-cluster...

2016-08-18 Thread zjffdu

Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/14639
  
Thanks @sun-rui,  `EXISTING_SPARKR_BACKEND_PORT` do indicate cluster mode 
indirectly for now. But here not only deployMode is unknown in R side, but also 
master and other spark configurations.  For now R only use master & depolyMode, 
but maybe in future it would use other configurations, so here I think 
propogating SparkConf from JVM to R is a better long term solution. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r75429434
  
--- Diff: R/pkg/R/generics.R ---
@@ -735,6 +752,8 @@ setGeneric("between", function(x, bounds) { 
standardGeneric("between") })
 setGeneric("cast", function(x, dataType) { standardGeneric("cast") })
 
 #' @rdname columnfunctions
+#' @param x a Column object.
+#' @param ... additional argument(s).
--- End diff --

It seems the functions are defined in the following way:

column_functions2 <- c("like", "rlike", "getField", "getItem", "contains")

createColumnFunction2 <- function(name) {
  setMethod(name,
signature(x = "Column"),
function(x, data) {
  if (class(data) == "Column") {
data <- data@jc
  }
  jc <- callJMethod(x@jc, name, data)
  column(jc)
})
}

It seesm the functions are not exported. Perhaps we still leave it there?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/14447
  
we are looking at establishing some guidelines in PR 14705. Let's hold on 
for another day or 2.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14639: [SPARK-17054][SPARKR] SparkR can not run in yarn-cluster...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14639
  
**[Test build #64043 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64043/consoleFull)**
 for PR 14639 at commit 
[`fef88cd`](https://github.com/apache/spark/commit/fef88cd5e62c65e7b5606f7baf00ecc4290ba12d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/duplicated a...

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/14705
  
looking good - looks like we are very close.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r75429158
  
--- Diff: R/pkg/R/mllib.R ---
@@ -917,14 +922,14 @@ setMethod("spark.lda", signature(data = 
"SparkDataFrame"),
 # Returns a summary of the AFT survival regression model produced by 
spark.survreg,
 # similarly to R's summary().
 
-#' @param object A fitted AFT survival regression model
+#' @param object a fitted AFT survival regression model.
 #' @return \code{summary} returns a list containing the model's 
coefficients,
 #' intercept and log(scale)
 #' @rdname spark.survreg
 #' @export
 #' @note summary(AFTSurvivalRegressionModel) since 2.0.0
 setMethod("summary", signature(object = "AFTSurvivalRegressionModel"),
-  function(object, ...) {
+  function(object) {
--- End diff --

probably I was more wondering why CRAN check didn't flag this...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r75429017
  
--- Diff: R/pkg/R/mllib.R ---
@@ -504,14 +504,15 @@ setMethod("summary", signature(object = 
"IsotonicRegressionModel"),
 #' Users can call \code{summary} to print a summary of the fitted model, 
\code{predict} to make
 #' predictions on new data, and \code{write.ml}/\code{read.ml} to 
save/load fitted models.
 #'
-#' @param data SparkDataFrame for training
-#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
 #'operators are supported, including '~', '.', ':', '+', 
and '-'.
 #'Note that the response variable of formula is empty in 
spark.kmeans.
-#' @param k Number of centers
-#' @param maxIter Maximum iteration number
-#' @param initMode The initialization algorithm choosen to fit the model
-#' @return \code{spark.kmeans} returns a fitted k-means model
+#' @param ... additional argument(s) passed to the method.
--- End diff --

Yeah, didn't notice this. Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/duplicated a...

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/14705
  
@inheritParams would be the way to go.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/duplicated a...

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/14705
  
@HyukjinKwon - we don't have a coding style guide for R. We have some style 
check with lint-r.
In addition, the document style you are looking at is a bit different from 
coding style - this document style I'm planning to write one after this is 
merged. Perhaps a coding style could be good too, for things like eg. "what to 
do with method without parameter".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13796: [SPARK-7159][ML] Add multiclass logistic regression to S...

2016-08-18 Thread sethah

Github user sethah commented on the issue:

https://github.com/apache/spark/pull/13796
  
@dbtsai Thanks for all of your meticulous review. Very much appreciated! 
Glad we can have MLOR in Spark ML now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/duplicated a...

Github user junyangq commented on the issue:

https://github.com/apache/spark/pull/14705
  
@shivaram I found perhaps a neat way to document R'glm if we don't want to 
remove it is to use `@inheritParams stats::glm`. That will bring in all the 
parameters from `stats::glm` not listed in SparkR's glm. That also means we 
need slight modification of the `data` description: something like "a 
SparkDataFrame or R's glm data for training."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...

2016-08-18 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/13796#discussion_r75428713
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala
 ---
@@ -0,0 +1,611 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.types.DoubleType
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Params for multinomial logistic (softmax) regression.
+ */
+private[classification] trait MultinomialLogisticRegressionParams
+  extends ProbabilisticClassifierParams with HasRegParam with 
HasElasticNetParam with HasMaxIter
+with HasFitIntercept with HasTol with HasStandardization with 
HasWeightCol {
+
+  /**
+   * Set thresholds in multiclass (or binary) classification to adjust the 
probability of
+   * predicting each class. Array must have length equal to the number of 
classes, with values >= 0.
+   * The class with largest value p/t is predicted, where p is the 
original probability of that
+   * class and t is the class' threshold.
+   *
+   * @group setParam
+   */
+  def setThresholds(value: Array[Double]): this.type = {
+set(thresholds, value)
+  }
+
+  /**
+   * Get thresholds for binary or multiclass classification.
+   *
+   * @group getParam
+   */
+  override def getThresholds: Array[Double] = {
+$(thresholds)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial Logistic (softmax) regression.
+ */
+@Since("2.1.0")
+@Experimental
+class MultinomialLogisticRegression @Since("2.1.0") (
+@Since("2.1.0") override val uid: String)
+  extends ProbabilisticClassifier[Vector,
+MultinomialLogisticRegression, MultinomialLogisticRegressionModel]
+with MultinomialLogisticRegressionParams with DefaultParamsWritable 
with Logging {
+
+  @Since("2.1.0")
+  def this() = this(Identifiable.randomUID("mlogreg"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the ElasticNet mixing parameter.
+   * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an 
L1 penalty.
+   * For 0 < alpha < 1, the penalty is a combination of L1 and L2.
+   * Default is 0.0 which is an L2 penalty.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setElasticNetParam(value: Double): this.type = set(elasticNetParam, 
value)
+  setDefault(elasticNetParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   * Default is

[GitHub] spark issue #13796: [SPARK-7159][ML] Add multiclass logistic regression to S...

2016-08-18 Thread dbtsai

Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/13796
  
@sethah Thank you for this great weighted MLOR work in Spark 2.1. I merged 
this PR into master, and let's discuss/work on the followups in separate JIRAs. 
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...

2016-08-18 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13796


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14279: [SPARK-16216][SQL] Read/write timestamps and dates in IS...

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14279
  
BTW I think this is pretty important for 2.0.1 release.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14279: [SPARK-16216][SQL] Read/write timestamps and dates in IS...

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14279
  
If we are introducing breaking changes to fix the bugs here, let's fix it 
for real. (definitely a problem if we can't specify dateFormat and 
timestampFormat separately).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...

2016-08-18 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13796#discussion_r75428163
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala
 ---
@@ -0,0 +1,611 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.types.DoubleType
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Params for multinomial logistic (softmax) regression.
+ */
+private[classification] trait MultinomialLogisticRegressionParams
+  extends ProbabilisticClassifierParams with HasRegParam with 
HasElasticNetParam with HasMaxIter
+with HasFitIntercept with HasTol with HasStandardization with 
HasWeightCol {
+
+  /**
+   * Set thresholds in multiclass (or binary) classification to adjust the 
probability of
+   * predicting each class. Array must have length equal to the number of 
classes, with values >= 0.
+   * The class with largest value p/t is predicted, where p is the 
original probability of that
+   * class and t is the class' threshold.
+   *
+   * @group setParam
+   */
+  def setThresholds(value: Array[Double]): this.type = {
+set(thresholds, value)
+  }
+
+  /**
+   * Get thresholds for binary or multiclass classification.
+   *
+   * @group getParam
+   */
+  override def getThresholds: Array[Double] = {
+$(thresholds)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial Logistic (softmax) regression.
+ */
+@Since("2.1.0")
+@Experimental
+class MultinomialLogisticRegression @Since("2.1.0") (
+@Since("2.1.0") override val uid: String)
+  extends ProbabilisticClassifier[Vector,
+MultinomialLogisticRegression, MultinomialLogisticRegressionModel]
+with MultinomialLogisticRegressionParams with DefaultParamsWritable 
with Logging {
+
+  @Since("2.1.0")
+  def this() = this(Identifiable.randomUID("mlogreg"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the ElasticNet mixing parameter.
+   * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an 
L1 penalty.
+   * For 0 < alpha < 1, the penalty is a combination of L1 and L2.
+   * Default is 0.0 which is an L2 penalty.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setElasticNetParam(value: Double): this.type = set(elasticNetParam, 
value)
+  setDefault(elasticNetParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   * Default is

[GitHub] spark issue #14710: [SPARK-16533][CORE]

2016-08-18 Thread petermaxlee

Github user petermaxlee commented on the issue:

https://github.com/apache/spark/pull/14710
  
cc @vanzin and @kayousterhout 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r75427819
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1719,12 +1732,13 @@ setMethod("[", signature(x = "SparkDataFrame"),
 #' Subset
 #'
 #' Return subsets of SparkDataFrame according to given conditions
-#' @param x A SparkDataFrame
-#' @param subset (Optional) A logical expression to filter on rows
-#' @param select expression for the single Column or a list of columns to 
select from the SparkDataFrame
+#' @param x a SparkDataFrame.
+#' @param i,subset (Optional) a logical expression to filter on rows.
+#' @param j,select expression for the single Column or a list of columns 
to select from the SparkDataFrame.
--- End diff --

perhaps better to rename `i`->`subset` and `j`->`select`?
I didn't find a reason for having `i` and `j` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14222: [SPARK-16391][SQL] KeyValueGroupedDataset.reduceGroups s...

2016-08-18 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/14222
  
Close this now since the pr #14576 is merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14222: [SPARK-16391][SQL] KeyValueGroupedDataset.reduceG...

2016-08-18 Thread viirya

Github user viirya closed the pull request at:

https://github.com/apache/spark/pull/14222


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r75427717
  
--- Diff: R/pkg/R/mllib.R ---
@@ -917,14 +922,14 @@ setMethod("spark.lda", signature(data = 
"SparkDataFrame"),
 # Returns a summary of the AFT survival regression model produced by 
spark.survreg,
 # similarly to R's summary().
 
-#' @param object A fitted AFT survival regression model
+#' @param object a fitted AFT survival regression model.
 #' @return \code{summary} returns a list containing the model's 
coefficients,
 #' intercept and log(scale)
 #' @rdname spark.survreg
 #' @export
 #' @note summary(AFTSurvivalRegressionModel) since 2.0.0
 setMethod("summary", signature(object = "AFTSurvivalRegressionModel"),
-  function(object, ...) {
+  function(object) {
--- End diff --

We have `...` for `summary`? That is used to match the `base::summary` 
signature. I am not completely sure about the exact reason, but I read from the 
[doc for 
Methods](https://stat.ethz.ch/R-manual/R-devel/library/methods/html/Methods.html)
 saying 

"By default, the signature of the generic consists of all the formal 
arguments except ..., in the order they appear in the function definition." 

Does that perhaps explain that behavior?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r75427619
  
--- Diff: R/pkg/R/functions.R ---
@@ -1848,7 +1850,7 @@ setMethod("upper",
 #' @note var since 1.6.0
 setMethod("var",
   signature(x = "Column"),
-  function(x) {
+  function(x, y, na.rm, use) {
--- End diff --

please see the example for `sd`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r75427590
  
--- Diff: R/pkg/R/functions.R ---
@@ -1335,7 +1336,7 @@ setMethod("rtrim",
 #' @note sd since 1.6.0
 setMethod("sd",
   signature(x = "Column"),
-  function(x) {
+  function(x, na.rm) {
--- End diff --

It seems to work:
```
> setGeneric("sd", function(x, na.rm = FALSE) { standardGeneric("sd") })
Creating a new generic function for âsdâ in the global environment
[1] "sd"
> setMethod("sd", signature(x = "character"), function(x) { print("blah") })
[1] "sd"
> sd(1)
[1] NA
> sd(1:2)
[1] 0.7071068
> sd("abc")
[1] "blah"
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13950: [SPARK-15487] [Web UI] Spark Master UI to reverse proxy ...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13950
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64031/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13950: [SPARK-15487] [Web UI] Spark Master UI to reverse proxy ...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13950
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13950: [SPARK-15487] [Web UI] Spark Master UI to reverse proxy ...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13950
  
**[Test build #64031 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64031/consoleFull)**
 for PR 13950 at commit 
[`032ac0e`](https://github.com/apache/spark/commit/032ac0ecfb14fba2a0d2872b406993243d39ca8b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75427102
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1088,109 +1108,86 @@ private[spark] class BlockManager(
   }
 
   /**
-   * Replicate block to another node. Not that this is a blocking call 
that returns after
+   * Replicate block to another node. Note that this is a blocking call 
that returns after
* the block has been replicated.
+   *
+   * @param blockId
+   * @param data
+   * @param level
+   * @param classTag
*/
   private def replicate(
-  blockId: BlockId,
-  data: ChunkedByteBuffer,
-  level: StorageLevel,
-  classTag: ClassTag[_]): Unit = {
+blockId: BlockId,
+data: ChunkedByteBuffer,
+level: StorageLevel,
+classTag: ClassTag[_]): Unit = {
+
 val maxReplicationFailures = 
conf.getInt("spark.storage.maxReplicationFailures", 1)
-val numPeersToReplicateTo = level.replication - 1
-val peersForReplication = new ArrayBuffer[BlockManagerId]
-val peersReplicatedTo = new ArrayBuffer[BlockManagerId]
-val peersFailedToReplicateTo = new ArrayBuffer[BlockManagerId]
 val tLevel = StorageLevel(
   useDisk = level.useDisk,
   useMemory = level.useMemory,
   useOffHeap = level.useOffHeap,
   deserialized = level.deserialized,
   replication = 1)
+
+val numPeersToReplicateTo = level.replication - 1
+
 val startTime = System.currentTimeMillis
-val random = new Random(blockId.hashCode)
-
-var replicationFailed = false
-var failures = 0
-var done = false
-
-// Get cached list of peers
-peersForReplication ++= getPeers(forceFetch = false)
-
-// Get a random peer. Note that this selection of a peer is 
deterministic on the block id.
-// So assuming the list of peers does not change and no replication 
failures,
-// if there are multiple attempts in the same node to replicate the 
same block,
-// the same set of peers will be selected.
-def getRandomPeer(): Option[BlockManagerId] = {
-  // If replication had failed, then force update the cached list of 
peers and remove the peers
-  // that have been already used
-  if (replicationFailed) {
-peersForReplication.clear()
-peersForReplication ++= getPeers(forceFetch = true)
-peersForReplication --= peersReplicatedTo
-peersForReplication --= peersFailedToReplicateTo
-  }
-  if (!peersForReplication.isEmpty) {
-Some(peersForReplication(random.nextInt(peersForReplication.size)))
-  } else {
-None
-  }
-}
 
-// One by one choose a random peer and try uploading the block to it
-// If replication fails (e.g., target peer is down), force the list of 
cached peers
-// to be re-fetched from driver and then pick another random peer for 
replication. Also
-// temporarily black list the peer for which replication failed.
-//
-// This selection of a peer and replication is continued in a loop 
until one of the
-// following 3 conditions is fulfilled:
-// (i) specified number of peers have been replicated to
-// (ii) too many failures in replicating to peers
-// (iii) no peer left to replicate to
-//
-while (!done) {
-  getRandomPeer() match {
-case Some(peer) =>
-  try {
-val onePeerStartTime = System.currentTimeMillis
-logTrace(s"Trying to replicate $blockId of ${data.size} bytes 
to $peer")
-blockTransferService.uploadBlockSync(
-  peer.host,
-  peer.port,
-  peer.executorId,
-  blockId,
-  new NettyManagedBuffer(data.toNetty),
-  tLevel,
-  classTag)
-logTrace(s"Replicated $blockId of ${data.size} bytes to $peer 
in %s ms"
-  .format(System.currentTimeMillis - onePeerStartTime))
-peersReplicatedTo += peer
-peersForReplication -= peer
-replicationFailed = false
-if (peersReplicatedTo.size == numPeersToReplicateTo) {
-  done = true  // specified number of peers have been 
replicated to
-}
-  } catch {
-case e: Exception =>
-  logWarning(s"Failed to replicate $blockId to $peer, failure 
#$failures", e)
-  failures += 1
-  replicationFailed = true
-  peersFailedToReplicateTo += peer
-  if (failures > maxReplicationFailures) { //

[GitHub] spark issue #14710: [SPARK-16533][CORE]

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14710
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r75427019
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1202,6 +1215,7 @@ setMethod("toRDD",
 #' Groups the SparkDataFrame using the specified columns, so we can run 
aggregation on them.
 #'
 #' @param x a SparkDataFrame
+#' @param ... variable(s) (character names(s) or Column(s)) to group on.
 #' @return a GroupedData
--- End diff --

I basically follow the convention of the docs of many R base functions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426940
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1088,109 +1108,86 @@ private[spark] class BlockManager(
   }
 
   /**
-   * Replicate block to another node. Not that this is a blocking call 
that returns after
+   * Replicate block to another node. Note that this is a blocking call 
that returns after
* the block has been replicated.
+   *
+   * @param blockId
+   * @param data
+   * @param level
+   * @param classTag
*/
   private def replicate(
-  blockId: BlockId,
-  data: ChunkedByteBuffer,
-  level: StorageLevel,
-  classTag: ClassTag[_]): Unit = {
+blockId: BlockId,
+data: ChunkedByteBuffer,
+level: StorageLevel,
+classTag: ClassTag[_]): Unit = {
+
 val maxReplicationFailures = 
conf.getInt("spark.storage.maxReplicationFailures", 1)
-val numPeersToReplicateTo = level.replication - 1
-val peersForReplication = new ArrayBuffer[BlockManagerId]
-val peersReplicatedTo = new ArrayBuffer[BlockManagerId]
-val peersFailedToReplicateTo = new ArrayBuffer[BlockManagerId]
 val tLevel = StorageLevel(
   useDisk = level.useDisk,
   useMemory = level.useMemory,
   useOffHeap = level.useOffHeap,
   deserialized = level.deserialized,
   replication = 1)
+
+val numPeersToReplicateTo = level.replication - 1
+
 val startTime = System.currentTimeMillis
-val random = new Random(blockId.hashCode)
-
-var replicationFailed = false
-var failures = 0
-var done = false
-
-// Get cached list of peers
-peersForReplication ++= getPeers(forceFetch = false)
-
-// Get a random peer. Note that this selection of a peer is 
deterministic on the block id.
-// So assuming the list of peers does not change and no replication 
failures,
-// if there are multiple attempts in the same node to replicate the 
same block,
-// the same set of peers will be selected.
-def getRandomPeer(): Option[BlockManagerId] = {
-  // If replication had failed, then force update the cached list of 
peers and remove the peers
-  // that have been already used
-  if (replicationFailed) {
-peersForReplication.clear()
-peersForReplication ++= getPeers(forceFetch = true)
-peersForReplication --= peersReplicatedTo
-peersForReplication --= peersFailedToReplicateTo
-  }
-  if (!peersForReplication.isEmpty) {
-Some(peersForReplication(random.nextInt(peersForReplication.size)))
-  } else {
-None
-  }
-}
 
-// One by one choose a random peer and try uploading the block to it
-// If replication fails (e.g., target peer is down), force the list of 
cached peers
-// to be re-fetched from driver and then pick another random peer for 
replication. Also
-// temporarily black list the peer for which replication failed.
-//
-// This selection of a peer and replication is continued in a loop 
until one of the
-// following 3 conditions is fulfilled:
-// (i) specified number of peers have been replicated to
-// (ii) too many failures in replicating to peers
-// (iii) no peer left to replicate to
-//
-while (!done) {
-  getRandomPeer() match {
-case Some(peer) =>
-  try {
-val onePeerStartTime = System.currentTimeMillis
-logTrace(s"Trying to replicate $blockId of ${data.size} bytes 
to $peer")
-blockTransferService.uploadBlockSync(
-  peer.host,
-  peer.port,
-  peer.executorId,
-  blockId,
-  new NettyManagedBuffer(data.toNetty),
-  tLevel,
-  classTag)
-logTrace(s"Replicated $blockId of ${data.size} bytes to $peer 
in %s ms"
-  .format(System.currentTimeMillis - onePeerStartTime))
-peersReplicatedTo += peer
-peersForReplication -= peer
-replicationFailed = false
-if (peersReplicatedTo.size == numPeersToReplicateTo) {
-  done = true  // specified number of peers have been 
replicated to
-}
-  } catch {
-case e: Exception =>
-  logWarning(s"Failed to replicate $blockId to $peer, failure 
#$failures", e)
-  failures += 1
-  replicationFailed = true
-  peersFailedToReplicateTo += peer
-  if (failures > maxReplicationFailures) { //

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426951
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1088,109 +1108,86 @@ private[spark] class BlockManager(
   }
 
   /**
-   * Replicate block to another node. Not that this is a blocking call 
that returns after
+   * Replicate block to another node. Note that this is a blocking call 
that returns after
* the block has been replicated.
+   *
+   * @param blockId
+   * @param data
+   * @param level
+   * @param classTag
*/
   private def replicate(
-  blockId: BlockId,
-  data: ChunkedByteBuffer,
-  level: StorageLevel,
-  classTag: ClassTag[_]): Unit = {
+blockId: BlockId,
+data: ChunkedByteBuffer,
+level: StorageLevel,
+classTag: ClassTag[_]): Unit = {
+
 val maxReplicationFailures = 
conf.getInt("spark.storage.maxReplicationFailures", 1)
-val numPeersToReplicateTo = level.replication - 1
-val peersForReplication = new ArrayBuffer[BlockManagerId]
-val peersReplicatedTo = new ArrayBuffer[BlockManagerId]
-val peersFailedToReplicateTo = new ArrayBuffer[BlockManagerId]
 val tLevel = StorageLevel(
   useDisk = level.useDisk,
   useMemory = level.useMemory,
   useOffHeap = level.useOffHeap,
   deserialized = level.deserialized,
   replication = 1)
+
+val numPeersToReplicateTo = level.replication - 1
+
 val startTime = System.currentTimeMillis
-val random = new Random(blockId.hashCode)
-
-var replicationFailed = false
-var failures = 0
-var done = false
-
-// Get cached list of peers
-peersForReplication ++= getPeers(forceFetch = false)
-
-// Get a random peer. Note that this selection of a peer is 
deterministic on the block id.
-// So assuming the list of peers does not change and no replication 
failures,
-// if there are multiple attempts in the same node to replicate the 
same block,
-// the same set of peers will be selected.
-def getRandomPeer(): Option[BlockManagerId] = {
-  // If replication had failed, then force update the cached list of 
peers and remove the peers
-  // that have been already used
-  if (replicationFailed) {
-peersForReplication.clear()
-peersForReplication ++= getPeers(forceFetch = true)
-peersForReplication --= peersReplicatedTo
-peersForReplication --= peersFailedToReplicateTo
-  }
-  if (!peersForReplication.isEmpty) {
-Some(peersForReplication(random.nextInt(peersForReplication.size)))
-  } else {
-None
-  }
-}
 
-// One by one choose a random peer and try uploading the block to it
-// If replication fails (e.g., target peer is down), force the list of 
cached peers
-// to be re-fetched from driver and then pick another random peer for 
replication. Also
-// temporarily black list the peer for which replication failed.
-//
-// This selection of a peer and replication is continued in a loop 
until one of the
-// following 3 conditions is fulfilled:
-// (i) specified number of peers have been replicated to
-// (ii) too many failures in replicating to peers
-// (iii) no peer left to replicate to
-//
-while (!done) {
-  getRandomPeer() match {
-case Some(peer) =>
-  try {
-val onePeerStartTime = System.currentTimeMillis
-logTrace(s"Trying to replicate $blockId of ${data.size} bytes 
to $peer")
-blockTransferService.uploadBlockSync(
-  peer.host,
-  peer.port,
-  peer.executorId,
-  blockId,
-  new NettyManagedBuffer(data.toNetty),
-  tLevel,
-  classTag)
-logTrace(s"Replicated $blockId of ${data.size} bytes to $peer 
in %s ms"
-  .format(System.currentTimeMillis - onePeerStartTime))
-peersReplicatedTo += peer
-peersForReplication -= peer
-replicationFailed = false
-if (peersReplicatedTo.size == numPeersToReplicateTo) {
-  done = true  // specified number of peers have been 
replicated to
-}
-  } catch {
-case e: Exception =>
-  logWarning(s"Failed to replicate $blockId to $peer, failure 
#$failures", e)
-  failures += 1
-  replicationFailed = true
-  peersFailedToReplicateTo += peer
-  if (failures > maxReplicationFailures) { //

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r75426918
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1202,6 +1215,7 @@ setMethod("toRDD",
 #' Groups the SparkDataFrame using the specified columns, so we can run 
aggregation on them.
 #'
 #' @param x a SparkDataFrame
+#' @param ... variable(s) (character names(s) or Column(s)) to group on.
 #' @return a GroupedData
--- End diff --

Yeah, thanks for the catch. This makes perfect sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14710: [SPARK-16533][CORE]

2016-08-18 Thread angolon

GitHub user angolon opened a pull request:

https://github.com/apache/spark/pull/14710

[SPARK-16533][CORE]

## What changes were proposed in this pull request?
This pull request reverts the changes made as a part of #14605, which 
simply side-steps the deadlock issue. Instead, I propose the following approach:
* Use `scheduleWithFixedDelay` when calling 
`ExecutorAllocationManager.schedule` for scheduling executor requests. The 
intent of this is that if invocations are delayed beyond the default schedule 
interval on account of lock contention, then we avoid a situation where calls 
to `schedule` are made back-to-back, potentially releasing and then immediately 
reacquiring these locks - further exacerbating contention.
* Replace a number of calls to `askWithRetry` with `ask` inside of message 
handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us 
queue messages with the relevant endpoints, release whatever locks we might be 
holding, and then block whilst awaiting the response. This change is made at 
the cost of being able to retry should sending the message fail, as retrying 
outside of the lock could easily cause race conditions if other conflicting 
messages have been sent whilst awaiting a response. I believe this to be the 
lesser of two evils, as in many cases these RPC calls are to process local 
components, and so failures are more likely to be deterministic, and timeouts 
are more likely to be caused by lock contention.

## How was this patch tested?
Existing tests, and manual tests under yarn-client mode.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/angolon/spark SPARK-16533

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14710.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14710


commit cef69bf470199c63b6638933756b1d057dc890d1
Author: Angus Gerry 
Date:   2016-08-19T01:52:58Z

Revert "[SPARK-17022][YARN] Handle potential deadlock in driver handling 
messages"

This reverts commit ea0bf91b4a2ca3ef472906e50e31fd6268b6f53e.

commit 4970b3b0bcd834bbe5d5473a3065f04a48b12643
Author: Angus Gerry 
Date:   2016-08-09T04:45:29Z

[SPARK-16533][CORE] Use scheduleWithFixedDelay when calling 
ExecutorAllocatorManager.schedule to ease contention on locks.

commit 920274a3ed0b8278d38d721587a24c9441fa5ff3
Author: Angus Gerry 
Date:   2016-08-04T06:27:56Z

[SPARK-16533][CORE] Replace many calls to askWithRetry to plain old ask.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426791
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1088,109 +1108,86 @@ private[spark] class BlockManager(
   }
 
   /**
-   * Replicate block to another node. Not that this is a blocking call 
that returns after
+   * Replicate block to another node. Note that this is a blocking call 
that returns after
* the block has been replicated.
+   *
+   * @param blockId
+   * @param data
+   * @param level
+   * @param classTag
*/
   private def replicate(
-  blockId: BlockId,
-  data: ChunkedByteBuffer,
-  level: StorageLevel,
-  classTag: ClassTag[_]): Unit = {
+blockId: BlockId,
+data: ChunkedByteBuffer,
+level: StorageLevel,
+classTag: ClassTag[_]): Unit = {
+
 val maxReplicationFailures = 
conf.getInt("spark.storage.maxReplicationFailures", 1)
-val numPeersToReplicateTo = level.replication - 1
-val peersForReplication = new ArrayBuffer[BlockManagerId]
-val peersReplicatedTo = new ArrayBuffer[BlockManagerId]
-val peersFailedToReplicateTo = new ArrayBuffer[BlockManagerId]
 val tLevel = StorageLevel(
   useDisk = level.useDisk,
   useMemory = level.useMemory,
   useOffHeap = level.useOffHeap,
   deserialized = level.deserialized,
   replication = 1)
+
+val numPeersToReplicateTo = level.replication - 1
+
 val startTime = System.currentTimeMillis
-val random = new Random(blockId.hashCode)
-
-var replicationFailed = false
-var failures = 0
-var done = false
-
-// Get cached list of peers
-peersForReplication ++= getPeers(forceFetch = false)
-
-// Get a random peer. Note that this selection of a peer is 
deterministic on the block id.
-// So assuming the list of peers does not change and no replication 
failures,
-// if there are multiple attempts in the same node to replicate the 
same block,
-// the same set of peers will be selected.
-def getRandomPeer(): Option[BlockManagerId] = {
-  // If replication had failed, then force update the cached list of 
peers and remove the peers
-  // that have been already used
-  if (replicationFailed) {
-peersForReplication.clear()
-peersForReplication ++= getPeers(forceFetch = true)
-peersForReplication --= peersReplicatedTo
-peersForReplication --= peersFailedToReplicateTo
-  }
-  if (!peersForReplication.isEmpty) {
-Some(peersForReplication(random.nextInt(peersForReplication.size)))
-  } else {
-None
-  }
-}
 
-// One by one choose a random peer and try uploading the block to it
-// If replication fails (e.g., target peer is down), force the list of 
cached peers
-// to be re-fetched from driver and then pick another random peer for 
replication. Also
-// temporarily black list the peer for which replication failed.
-//
-// This selection of a peer and replication is continued in a loop 
until one of the
-// following 3 conditions is fulfilled:
-// (i) specified number of peers have been replicated to
-// (ii) too many failures in replicating to peers
-// (iii) no peer left to replicate to
-//
-while (!done) {
-  getRandomPeer() match {
-case Some(peer) =>
-  try {
-val onePeerStartTime = System.currentTimeMillis
-logTrace(s"Trying to replicate $blockId of ${data.size} bytes 
to $peer")
-blockTransferService.uploadBlockSync(
-  peer.host,
-  peer.port,
-  peer.executorId,
-  blockId,
-  new NettyManagedBuffer(data.toNetty),
-  tLevel,
-  classTag)
-logTrace(s"Replicated $blockId of ${data.size} bytes to $peer 
in %s ms"
-  .format(System.currentTimeMillis - onePeerStartTime))
-peersReplicatedTo += peer
-peersForReplication -= peer
-replicationFailed = false
-if (peersReplicatedTo.size == numPeersToReplicateTo) {
-  done = true  // specified number of peers have been 
replicated to
-}
-  } catch {
-case e: Exception =>
-  logWarning(s"Failed to replicate $blockId to $peer, failure 
#$failures", e)
-  failures += 1
-  replicationFailed = true
-  peersFailedToReplicateTo += peer
-  if (failures > maxReplicationFailures) { //

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426781
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1088,109 +1108,86 @@ private[spark] class BlockManager(
   }
 
   /**
-   * Replicate block to another node. Not that this is a blocking call 
that returns after
+   * Replicate block to another node. Note that this is a blocking call 
that returns after
* the block has been replicated.
+   *
+   * @param blockId
+   * @param data
+   * @param level
+   * @param classTag
*/
   private def replicate(
-  blockId: BlockId,
-  data: ChunkedByteBuffer,
-  level: StorageLevel,
-  classTag: ClassTag[_]): Unit = {
+blockId: BlockId,
+data: ChunkedByteBuffer,
+level: StorageLevel,
+classTag: ClassTag[_]): Unit = {
+
 val maxReplicationFailures = 
conf.getInt("spark.storage.maxReplicationFailures", 1)
-val numPeersToReplicateTo = level.replication - 1
-val peersForReplication = new ArrayBuffer[BlockManagerId]
-val peersReplicatedTo = new ArrayBuffer[BlockManagerId]
-val peersFailedToReplicateTo = new ArrayBuffer[BlockManagerId]
 val tLevel = StorageLevel(
   useDisk = level.useDisk,
   useMemory = level.useMemory,
   useOffHeap = level.useOffHeap,
   deserialized = level.deserialized,
   replication = 1)
+
+val numPeersToReplicateTo = level.replication - 1
+
 val startTime = System.currentTimeMillis
-val random = new Random(blockId.hashCode)
-
-var replicationFailed = false
-var failures = 0
-var done = false
-
-// Get cached list of peers
-peersForReplication ++= getPeers(forceFetch = false)
-
-// Get a random peer. Note that this selection of a peer is 
deterministic on the block id.
-// So assuming the list of peers does not change and no replication 
failures,
-// if there are multiple attempts in the same node to replicate the 
same block,
-// the same set of peers will be selected.
-def getRandomPeer(): Option[BlockManagerId] = {
-  // If replication had failed, then force update the cached list of 
peers and remove the peers
-  // that have been already used
-  if (replicationFailed) {
-peersForReplication.clear()
-peersForReplication ++= getPeers(forceFetch = true)
-peersForReplication --= peersReplicatedTo
-peersForReplication --= peersFailedToReplicateTo
-  }
-  if (!peersForReplication.isEmpty) {
-Some(peersForReplication(random.nextInt(peersForReplication.size)))
-  } else {
-None
-  }
-}
 
-// One by one choose a random peer and try uploading the block to it
-// If replication fails (e.g., target peer is down), force the list of 
cached peers
-// to be re-fetched from driver and then pick another random peer for 
replication. Also
-// temporarily black list the peer for which replication failed.
-//
-// This selection of a peer and replication is continued in a loop 
until one of the
-// following 3 conditions is fulfilled:
-// (i) specified number of peers have been replicated to
-// (ii) too many failures in replicating to peers
-// (iii) no peer left to replicate to
-//
-while (!done) {
-  getRandomPeer() match {
-case Some(peer) =>
-  try {
-val onePeerStartTime = System.currentTimeMillis
-logTrace(s"Trying to replicate $blockId of ${data.size} bytes 
to $peer")
-blockTransferService.uploadBlockSync(
-  peer.host,
-  peer.port,
-  peer.executorId,
-  blockId,
-  new NettyManagedBuffer(data.toNetty),
-  tLevel,
-  classTag)
-logTrace(s"Replicated $blockId of ${data.size} bytes to $peer 
in %s ms"
-  .format(System.currentTimeMillis - onePeerStartTime))
-peersReplicatedTo += peer
-peersForReplication -= peer
-replicationFailed = false
-if (peersReplicatedTo.size == numPeersToReplicateTo) {
-  done = true  // specified number of peers have been 
replicated to
-}
-  } catch {
-case e: Exception =>
-  logWarning(s"Failed to replicate $blockId to $peer, failure 
#$failures", e)
-  failures += 1
-  replicationFailed = true
-  peersFailedToReplicateTo += peer
-  if (failures > maxReplicationFailures) { //

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426753
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1088,109 +1108,86 @@ private[spark] class BlockManager(
   }
 
   /**
-   * Replicate block to another node. Not that this is a blocking call 
that returns after
+   * Replicate block to another node. Note that this is a blocking call 
that returns after
* the block has been replicated.
+   *
+   * @param blockId
+   * @param data
+   * @param level
+   * @param classTag
*/
   private def replicate(
-  blockId: BlockId,
-  data: ChunkedByteBuffer,
-  level: StorageLevel,
-  classTag: ClassTag[_]): Unit = {
+blockId: BlockId,
+data: ChunkedByteBuffer,
+level: StorageLevel,
+classTag: ClassTag[_]): Unit = {
+
 val maxReplicationFailures = 
conf.getInt("spark.storage.maxReplicationFailures", 1)
-val numPeersToReplicateTo = level.replication - 1
-val peersForReplication = new ArrayBuffer[BlockManagerId]
-val peersReplicatedTo = new ArrayBuffer[BlockManagerId]
-val peersFailedToReplicateTo = new ArrayBuffer[BlockManagerId]
 val tLevel = StorageLevel(
   useDisk = level.useDisk,
   useMemory = level.useMemory,
   useOffHeap = level.useOffHeap,
   deserialized = level.deserialized,
   replication = 1)
+
+val numPeersToReplicateTo = level.replication - 1
+
 val startTime = System.currentTimeMillis
-val random = new Random(blockId.hashCode)
-
-var replicationFailed = false
-var failures = 0
-var done = false
-
-// Get cached list of peers
-peersForReplication ++= getPeers(forceFetch = false)
-
-// Get a random peer. Note that this selection of a peer is 
deterministic on the block id.
-// So assuming the list of peers does not change and no replication 
failures,
-// if there are multiple attempts in the same node to replicate the 
same block,
-// the same set of peers will be selected.
-def getRandomPeer(): Option[BlockManagerId] = {
-  // If replication had failed, then force update the cached list of 
peers and remove the peers
-  // that have been already used
-  if (replicationFailed) {
-peersForReplication.clear()
-peersForReplication ++= getPeers(forceFetch = true)
-peersForReplication --= peersReplicatedTo
-peersForReplication --= peersFailedToReplicateTo
-  }
-  if (!peersForReplication.isEmpty) {
-Some(peersForReplication(random.nextInt(peersForReplication.size)))
-  } else {
-None
-  }
-}
 
-// One by one choose a random peer and try uploading the block to it
-// If replication fails (e.g., target peer is down), force the list of 
cached peers
-// to be re-fetched from driver and then pick another random peer for 
replication. Also
-// temporarily black list the peer for which replication failed.
-//
-// This selection of a peer and replication is continued in a loop 
until one of the
-// following 3 conditions is fulfilled:
-// (i) specified number of peers have been replicated to
-// (ii) too many failures in replicating to peers
-// (iii) no peer left to replicate to
-//
-while (!done) {
-  getRandomPeer() match {
-case Some(peer) =>
-  try {
-val onePeerStartTime = System.currentTimeMillis
-logTrace(s"Trying to replicate $blockId of ${data.size} bytes 
to $peer")
-blockTransferService.uploadBlockSync(
-  peer.host,
-  peer.port,
-  peer.executorId,
-  blockId,
-  new NettyManagedBuffer(data.toNetty),
-  tLevel,
-  classTag)
-logTrace(s"Replicated $blockId of ${data.size} bytes to $peer 
in %s ms"
-  .format(System.currentTimeMillis - onePeerStartTime))
-peersReplicatedTo += peer
-peersForReplication -= peer
-replicationFailed = false
-if (peersReplicatedTo.size == numPeersToReplicateTo) {
-  done = true  // specified number of peers have been 
replicated to
-}
-  } catch {
-case e: Exception =>
-  logWarning(s"Failed to replicate $blockId to $peer, failure 
#$failures", e)
-  failures += 1
-  replicationFailed = true
-  peersFailedToReplicateTo += peer
-  if (failures > maxReplicationFailures) { //

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426714
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1088,109 +1108,86 @@ private[spark] class BlockManager(
   }
 
   /**
-   * Replicate block to another node. Not that this is a blocking call 
that returns after
+   * Replicate block to another node. Note that this is a blocking call 
that returns after
* the block has been replicated.
+   *
+   * @param blockId
+   * @param data
+   * @param level
+   * @param classTag
*/
   private def replicate(
-  blockId: BlockId,
-  data: ChunkedByteBuffer,
-  level: StorageLevel,
-  classTag: ClassTag[_]): Unit = {
+blockId: BlockId,
+data: ChunkedByteBuffer,
+level: StorageLevel,
+classTag: ClassTag[_]): Unit = {
+
 val maxReplicationFailures = 
conf.getInt("spark.storage.maxReplicationFailures", 1)
-val numPeersToReplicateTo = level.replication - 1
-val peersForReplication = new ArrayBuffer[BlockManagerId]
-val peersReplicatedTo = new ArrayBuffer[BlockManagerId]
-val peersFailedToReplicateTo = new ArrayBuffer[BlockManagerId]
 val tLevel = StorageLevel(
   useDisk = level.useDisk,
   useMemory = level.useMemory,
   useOffHeap = level.useOffHeap,
   deserialized = level.deserialized,
   replication = 1)
+
+val numPeersToReplicateTo = level.replication - 1
+
 val startTime = System.currentTimeMillis
-val random = new Random(blockId.hashCode)
-
-var replicationFailed = false
-var failures = 0
-var done = false
-
-// Get cached list of peers
-peersForReplication ++= getPeers(forceFetch = false)
-
-// Get a random peer. Note that this selection of a peer is 
deterministic on the block id.
-// So assuming the list of peers does not change and no replication 
failures,
-// if there are multiple attempts in the same node to replicate the 
same block,
-// the same set of peers will be selected.
-def getRandomPeer(): Option[BlockManagerId] = {
-  // If replication had failed, then force update the cached list of 
peers and remove the peers
-  // that have been already used
-  if (replicationFailed) {
-peersForReplication.clear()
-peersForReplication ++= getPeers(forceFetch = true)
-peersForReplication --= peersReplicatedTo
-peersForReplication --= peersFailedToReplicateTo
-  }
-  if (!peersForReplication.isEmpty) {
-Some(peersForReplication(random.nextInt(peersForReplication.size)))
-  } else {
-None
-  }
-}
 
-// One by one choose a random peer and try uploading the block to it
-// If replication fails (e.g., target peer is down), force the list of 
cached peers
-// to be re-fetched from driver and then pick another random peer for 
replication. Also
-// temporarily black list the peer for which replication failed.
-//
-// This selection of a peer and replication is continued in a loop 
until one of the
-// following 3 conditions is fulfilled:
-// (i) specified number of peers have been replicated to
-// (ii) too many failures in replicating to peers
-// (iii) no peer left to replicate to
-//
-while (!done) {
-  getRandomPeer() match {
-case Some(peer) =>
-  try {
-val onePeerStartTime = System.currentTimeMillis
-logTrace(s"Trying to replicate $blockId of ${data.size} bytes 
to $peer")
-blockTransferService.uploadBlockSync(
-  peer.host,
-  peer.port,
-  peer.executorId,
-  blockId,
-  new NettyManagedBuffer(data.toNetty),
-  tLevel,
-  classTag)
-logTrace(s"Replicated $blockId of ${data.size} bytes to $peer 
in %s ms"
-  .format(System.currentTimeMillis - onePeerStartTime))
-peersReplicatedTo += peer
-peersForReplication -= peer
-replicationFailed = false
-if (peersReplicatedTo.size == numPeersToReplicateTo) {
-  done = true  // specified number of peers have been 
replicated to
-}
-  } catch {
-case e: Exception =>
-  logWarning(s"Failed to replicate $blockId to $peer, failure 
#$failures", e)
-  failures += 1
-  replicationFailed = true
-  peersFailedToReplicateTo += peer
-  if (failures > maxReplicationFailures) { //

[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14426
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14426
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64042/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14426
  
**[Test build #64042 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64042/consoleFull)**
 for PR 14426 at commit 
[`d722be2`](https://github.com/apache/spark/commit/d722be2c1660b86eb1cc23cfa1dad33095c839b7).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class Hint(name: String, parameters: Seq[String], child: 
LogicalPlan) extends UnaryNode `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426587
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockReplicationPrioritization.scala
 ---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.storage
+
+import scala.util.Random
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.internal.Logging
+
+/**
+ * ::DeveloperApi::
+ * BlockReplicationPrioritization provides logic for prioritizing a 
sequence of peers for
+ * replicating blocks. BlockManager will replicate to each peer returned 
in order until the
+ * desired replication order is reached. If a replication fails, 
prioritize() will be called
+ * again to get a fresh prioritization.
+ */
+@DeveloperApi
+trait BlockReplicationPrioritization {
--- End diff --

can we just name this BlockReplicationPolicy?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426605
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -20,6 +20,7 @@ package org.apache.spark.storage
 import java.io._
 import java.nio.ByteBuffer
 
+import scala.annotation.tailrec
--- End diff --

is this used anywhere?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426552
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1088,109 +1108,86 @@ private[spark] class BlockManager(
   }
 
   /**
-   * Replicate block to another node. Not that this is a blocking call 
that returns after
+   * Replicate block to another node. Note that this is a blocking call 
that returns after
* the block has been replicated.
+   *
+   * @param blockId
+   * @param data
+   * @param level
+   * @param classTag
*/
   private def replicate(
-  blockId: BlockId,
-  data: ChunkedByteBuffer,
-  level: StorageLevel,
-  classTag: ClassTag[_]): Unit = {
+blockId: BlockId,
--- End diff --

reset the change here - use 4 space indent


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426567
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1088,109 +1108,86 @@ private[spark] class BlockManager(
   }
 
   /**
-   * Replicate block to another node. Not that this is a blocking call 
that returns after
+   * Replicate block to another node. Note that this is a blocking call 
that returns after
* the block has been replicated.
+   *
+   * @param blockId
--- End diff --

remove these params unless you really are going to document them.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426531
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala 
---
@@ -69,24 +72,37 @@ class BlockManagerId private (
 out.writeUTF(executorId_)
 out.writeUTF(host_)
 out.writeInt(port_)
+out.writeBoolean(topologyInfo_.isDefined)
+// we only write topologyInfo if we have it
+topologyInfo.foreach(out.writeUTF(_))
   }
 
   override def readExternal(in: ObjectInput): Unit = 
Utils.tryOrIOException {
 executorId_ = in.readUTF()
 host_ = in.readUTF()
 port_ = in.readInt()
+val isTopologyInfoAvailable = in.readBoolean()
+topologyInfo_ = if (isTopologyInfoAvailable) {
--- End diff --

it might be more clear to do
```
if (isTopologyInfoAvailable) {
  topologyInfo_ = Option(in.readUTF())
}  else {
  topologyInfo_ = None
}
```

or

```
topologyInfo_ = if (isTopologyInfoAvailable) Option(in.readUTF()) else None
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426476
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala 
---
@@ -101,10 +117,18 @@ private[spark] object BlockManagerId {
* @param execId ID of the executor.
* @param host Host name of the block manager.
* @param port Port of the block manager.
+   * @param topologyInfo topology information for the blockmanager, if 
available
+   * This can be network topology information for use 
while choosing peers
+   * while replicating data blocks. More information 
available here:
+   * [[org.apache.spark.storage.TopologyMapper]]
* @return A new [[org.apache.spark.storage.BlockManagerId]].
*/
-  def apply(execId: String, host: String, port: Int): BlockManagerId =
-getCachedBlockManagerId(new BlockManagerId(execId, host, port))
+  def apply(
+execId: String,
--- End diff --

4 space indent here too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426446
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala 
---
@@ -298,7 +310,17 @@ class BlockManagerMasterEndpoint(
 ).map(_.flatten.toSeq)
   }
 
-  private def register(id: BlockManagerId, maxMemSize: Long, 
slaveEndpoint: RpcEndpointRef) {
+  private def register(dummyId: BlockManagerId,
+maxMemSize: Long,
--- End diff --

Can you also add a method doc saying this returns the same id with topology 
information attached?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426435
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala 
---
@@ -298,7 +310,17 @@ class BlockManagerMasterEndpoint(
 ).map(_.flatten.toSeq)
   }
 
-  private def register(id: BlockManagerId, maxMemSize: Long, 
slaveEndpoint: RpcEndpointRef) {
+  private def register(dummyId: BlockManagerId,
+maxMemSize: Long,
--- End diff --

also instead of dummyId, I'd call it "idWithoutTopologyInfo"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14426
  
**[Test build #64042 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64042/consoleFull)**
 for PR 14426 at commit 
[`d722be2`](https://github.com/apache/spark/commit/d722be2c1660b86eb1cc23cfa1dad33095c839b7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426412
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala 
---
@@ -298,7 +310,17 @@ class BlockManagerMasterEndpoint(
 ).map(_.flatten.toSeq)
   }
 
-  private def register(id: BlockManagerId, maxMemSize: Long, 
slaveEndpoint: RpcEndpointRef) {
+  private def register(dummyId: BlockManagerId,
+maxMemSize: Long,
--- End diff --

4 space indent, and put all the arguments on its own line, e.g.
```
private def register(
dummyId: BlockManagerId,
maxMemSize: Long,
slaveEndpoint: RpcEndpointRef): BlockManagerId = {
  ...
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426383
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala ---
@@ -50,12 +50,20 @@ class BlockManagerMaster(
 logInfo("Removal of executor " + execId + " requested")
   }
 
-  /** Register the BlockManager's id with the driver. */
+  /**
+   * Register the BlockManager's id with the driver. The input 
BlockManagerId does not contain
+   * topology information. This information is obtained from the master 
and we respond with an
+   * updated BlockManagerId fleshed out with this information.
+   */
   def registerBlockManager(
-  blockManagerId: BlockManagerId, maxMemSize: Long, slaveEndpoint: 
RpcEndpointRef): Unit = {
+blockManagerId: BlockManagerId,
--- End diff --

indent 4 spaces


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426300
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -160,8 +163,25 @@ private[spark] class BlockManager(
 blockTransferService.init(this)
 shuffleClient.init(appId)
 
-blockManagerId = BlockManagerId(
-  executorId, blockTransferService.hostName, blockTransferService.port)
+blockReplicationPrioritizer = {
+  val priorityClass = conf.get(
+"spark.replication.topologyawareness.prioritizer",
+"org.apache.spark.storage.DefaultBlockReplicationPrioritization")
+  val clazz = Utils.classForName(priorityClass)
+  val ret = 
clazz.newInstance.asInstanceOf[BlockReplicationPrioritization]
+  logInfo(s"Using $priorityClass for prioritizing peers")
--- End diff --

```
Using $priorityClass for block replication policy
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14639: [SPARK-17054][SPARKR] SparkR can not run in yarn-cluster...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14639
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426231
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockReplicationPrioritization.scala
 ---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.storage
+
+import scala.util.Random
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.internal.Logging
+
+/**
+ * ::DeveloperApi::
+ * BlockReplicationPrioritization provides logic for prioritizing a 
sequence of peers for
+ * replicating blocks. BlockManager will replicate to each peer returned 
in order until the
+ * desired replication order is reached. If a replication fails, 
prioritize() will be called
+ * again to get a fresh prioritization.
+ */
+@DeveloperApi
+trait BlockReplicationPrioritization {
+
+  /**
+   * Method to prioritize a bunch of candidate peers of a block
+   *
+   * @param blockManagerId Id of the current BlockManager for self 
identification
+   * @param peers A list of peers of a BlockManager
+   * @param peersReplicatedTo Set of peers already replicated to
+   * @param blockId BlockId of the block being replicated. This can be 
used as a source of
+   *randomness if needed.
+   * @return A prioritized list of peers. Lower the index of a peer, 
higher its priority
+   */
+  def prioritize(
+blockManagerId: BlockManagerId,
+peers: Seq[BlockManagerId],
--- End diff --

also rather than a full prioritization, can we also pass in a number of 
replicas wanted and just return a list there?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHEMA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14116
  
**[Test build #64041 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64041/consoleFull)**
 for PR 14116 at commit 
[`bd85aa5`](https://github.com/apache/spark/commit/bd85aa545e1fcb7ff10c981ef940291092cfef80).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14639: [SPARK-17054][SPARKR] SparkR can not run in yarn-cluster...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14639
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64039/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14639: [SPARK-17054][SPARKR] SparkR can not run in yarn-cluster...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14639
  
**[Test build #64039 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64039/consoleFull)**
 for PR 14639 at commit 
[`31ada09`](https://github.com/apache/spark/commit/31ada09b55ca34a4a6fa150037025afb831df69d).
 * This patch **fails R style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426199
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockReplicationPrioritization.scala
 ---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.storage
+
+import scala.util.Random
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.internal.Logging
+
+/**
+ * ::DeveloperApi::
+ * BlockReplicationPrioritization provides logic for prioritizing a 
sequence of peers for
+ * replicating blocks. BlockManager will replicate to each peer returned 
in order until the
+ * desired replication order is reached. If a replication fails, 
prioritize() will be called
+ * again to get a fresh prioritization.
+ */
+@DeveloperApi
+trait BlockReplicationPrioritization {
+
+  /**
+   * Method to prioritize a bunch of candidate peers of a block
+   *
+   * @param blockManagerId Id of the current BlockManager for self 
identification
+   * @param peers A list of peers of a BlockManager
+   * @param peersReplicatedTo Set of peers already replicated to
+   * @param blockId BlockId of the block being replicated. This can be 
used as a source of
+   *randomness if needed.
+   * @return A prioritized list of peers. Lower the index of a peer, 
higher its priority
+   */
+  def prioritize(
+blockManagerId: BlockManagerId,
+peers: Seq[BlockManagerId],
--- End diff --

is passing in all the peers a performance concern?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426189
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockReplicationPrioritization.scala
 ---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.storage
+
+import scala.util.Random
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.internal.Logging
+
+/**
+ * ::DeveloperApi::
+ * BlockReplicationPrioritization provides logic for prioritizing a 
sequence of peers for
+ * replicating blocks. BlockManager will replicate to each peer returned 
in order until the
+ * desired replication order is reached. If a replication fails, 
prioritize() will be called
+ * again to get a fresh prioritization.
+ */
+@DeveloperApi
+trait BlockReplicationPrioritization {
+
+  /**
+   * Method to prioritize a bunch of candidate peers of a block
+   *
+   * @param blockManagerId Id of the current BlockManager for self 
identification
+   * @param peers A list of peers of a BlockManager
+   * @param peersReplicatedTo Set of peers already replicated to
+   * @param blockId BlockId of the block being replicated. This can be 
used as a source of
+   *randomness if needed.
+   * @return A prioritized list of peers. Lower the index of a peer, 
higher its priority
+   */
+  def prioritize(
+blockManagerId: BlockManagerId,
+peers: Seq[BlockManagerId],
+peersReplicatedTo: Set[BlockManagerId],
+blockId: BlockId): Seq[BlockManagerId]
+}
+
+@DeveloperApi
+class DefaultBlockReplicationPrioritization
--- End diff --

instead of Default, I'd call this RandomBlockReplicationPrioritization to 
better reflect what it does.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426147
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockReplicationPrioritization.scala
 ---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.storage
+
+import scala.util.Random
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.internal.Logging
+
+/**
+ * ::DeveloperApi::
+ * BlockReplicationPrioritization provides logic for prioritizing a 
sequence of peers for
+ * replicating blocks. BlockManager will replicate to each peer returned 
in order until the
+ * desired replication order is reached. If a replication fails, 
prioritize() will be called
+ * again to get a fresh prioritization.
+ */
+@DeveloperApi
+trait BlockReplicationPrioritization {
+
+  /**
+   * Method to prioritize a bunch of candidate peers of a block
+   *
+   * @param blockManagerId Id of the current BlockManager for self 
identification
+   * @param peers A list of peers of a BlockManager
+   * @param peersReplicatedTo Set of peers already replicated to
+   * @param blockId BlockId of the block being replicated. This can be 
used as a source of
+   *randomness if needed.
+   * @return A prioritized list of peers. Lower the index of a peer, 
higher its priority
+   */
+  def prioritize(
+blockManagerId: BlockManagerId,
+peers: Seq[BlockManagerId],
+peersReplicatedTo: Set[BlockManagerId],
+blockId: BlockId): Seq[BlockManagerId]
+}
+
+@DeveloperApi
+class DefaultBlockReplicationPrioritization
+  extends BlockReplicationPrioritization
+  with Logging {
+
+  /**
+   * Method to prioritize a bunch of candidate peers of a block. This is a 
basic implementation,
+   * that just makes sure we put blocks on different hosts, if possible
+   *
+   * @param blockManagerId Id of the current BlockManager for self 
identification
+   * @param peers A list of peers of a BlockManager
+   * @param peersReplicatedTo Set of peers already replicated to
+   * @param blockId BlockId of the block being replicated. This can be 
used as a source of
+   *randomness if needed.
+   * @return A prioritized list of peers. Lower the index of a peer, 
higher its priority
+   */
+  override def prioritize(
+blockManagerId: BlockManagerId,
--- End diff --

so the Spark style for indentation is to have 4 spaces for function 
arguments, i.e.
```scala
override def prioritize(
blockManagerId: BlockManagerId,,
peers: Seq[BlockManagerId],
peersReplicatedTo: Set[BlockManagerId],
blockId: BlockId): Seq[BlockManagerId] = {
  val random = new Random(blockId.hashCode)
  ...
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13152: [SPARK-15353] [CORE] Making peer selection for bl...

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13152#discussion_r75426070
  
--- Diff: core/src/main/scala/org/apache/spark/storage/TopologyMapper.scala 
---
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.storage
+
+import java.io.{File, FileInputStream}
+import java.util.Properties
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.internal.Logging
+import org.apache.spark.SparkConf
+import org.apache.spark.util.Utils
+
+/**
+ * ::DeveloperApi::
+ * TopologyMapper provides topology information for a given host
+ * @param conf SparkConf to get required properties, if needed
+ */
+@DeveloperApi
+abstract class TopologyMapper(conf: SparkConf) {
+  /**
+   * Gets the topology information given the host name
+   *
+   * @param hostname Hostname
+   * @return topology information for the given hostname. One can use a 
'topology delimiter'
+   * to make this topology information nested.
+   * For example : â/myrack/myhostâ, where â/â is the 
topology delimiter,
+   * âmyrackâ is the topology identifier, and âmyhostâ is 
the individual host.
+   * This function only returns the topology information without 
the hostname.
--- End diff --

can you document what an empty string means?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14118: [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make...

2016-08-18 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14118#discussion_r75426062
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -370,7 +370,8 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* from values being read should be skipped.
* `ignoreTrailingWhiteSpace` (default `false`): defines whether or 
not trailing
* whitespaces from values being read should be skipped.
-   * `nullValue` (default empty string): sets the string 
representation of a null value.
+   * `nullValue` (default empty string): sets the string 
representation of a null value. Since
--- End diff --

Oh thanks! Indeed there are two occurrences (one in readwriter.py / one in 
`streaming.py`) needs fixing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14639: [SPARK-17054][SPARKR] SparkR can not run in yarn-cluster...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14639
  
**[Test build #64039 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64039/consoleFull)**
 for PR 14639 at commit 
[`31ada09`](https://github.com/apache/spark/commit/31ada09b55ca34a4a6fa150037025afb831df69d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14680: [SPARK-17101][SQL] Provide consistent format identifiers...

2016-08-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14680

@jaceklaskowski It seems the test [here]
(https://github.com/apache/spark/blob/e50efd53f073890d789a8448f850cc219cca7708/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala#L715-L724)
is related with this change. It seems it will passes the test if we change
`TextFileFormat` to `TEXT`.

BTW, how about changing them to `Parquet` and `Text` maybe? I believe this
might be about personal taste though.. I feel like `shortName.toUpperCase` is
not always the string representation of each data source.

I mean.. if my understanding is correct, the proper name might be `Parquet`
rather than `PARQUET`, at least. It seems `ORC`, `JSON` and `CSV` are correct
names because they are abbreviated names but I feel like it is questionable for
`PARQUET` and `TEXT`.

If the purpose of this change is only to see the information about plans to
human via `explain(...)` regardless of anything, it might be better if it is
more close to human readable and correct names as string representation.

This is just my personal opinion. I think we need @rxin 's sign off here.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14118: [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV ca...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14118
  
**[Test build #64040 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64040/consoleFull)**
 for PR 14118 at commit 
[`74b4dd8`](https://github.com/apache/spark/commit/74b4dd8ff2f79faaf9df50c5a54e6298234137e7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13152: [SPARK-15353] [CORE] Making peer selection for block rep...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13152
  
**[Test build #3225 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3225/consoleFull)**
 for PR 13152 at commit 
[`9b8ce32`](https://github.com/apache/spark/commit/9b8ce3229d0cff64e77d55563cec3cc3cda29182).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r75425834
  
--- Diff: R/pkg/R/functions.R ---
@@ -362,8 +357,8 @@ setMethod("cov", signature(x = "characterOrColumn"),
 
 #' @rdname cov
 #'
-#' @param col1 First column to compute cov_samp.
-#' @param col2 Second column to compute cov_samp.
+#' @param col1 the first Column object.
+#' @param col2 the second Column object.
--- End diff --

I'd say that applies to a couple of other cases in WindowsSpec.R or 
column.R too, but I'm ok either way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r75425765
  
--- Diff: R/pkg/R/functions.R ---
@@ -362,8 +357,8 @@ setMethod("cov", signature(x = "characterOrColumn"),
 
 #' @rdname cov
 #'
-#' @param col1 First column to compute cov_samp.
-#' @param col2 Second column to compute cov_samp.
+#' @param col1 the first Column object.
+#' @param col2 the second Column object.
--- End diff --

I'd just say "the first Column", "the second Column` (without object) as in 
other places


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14705#discussion_r75425753
  
--- Diff: R/pkg/R/functions.R ---
@@ -319,7 +316,7 @@ setMethod("column",
 #'
 #' Computes the Pearson Correlation Coefficient for two Columns.
 #'
-#' @param x Column to compute on.
+#' @param col2 a (second) Column object.
--- End diff --

I'd just say "a (second) Column` (without object) as in other places


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/dupli...