[GitHub] spark pull request #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15292#discussion_r82332285 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -17,47 +17,130 @@ package org.apache.spark.sql.execution.datasources.jdbc +import java.sql.{Connection, DriverManager} +import java.util.Properties + /** * Options for the JDBC data source. */ class JDBCOptions( @transient private val parameters: Map[String, String]) extends Serializable { + import JDBCOptions._ + + def this(url: String, table: String, parameters: Map[String, String]) = { +this(parameters ++ Map("url" -> url, "dbtable" -> table)) + } + + val asProperties: Properties = { --- End diff -- I think the function name needs an update. How about `asConnectionProperties`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15375: [SPARK-17790] Support for parallelizing R data.fr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15375#discussion_r82331701 --- Diff: R/pkg/R/context.R --- @@ -123,19 +126,48 @@ parallelize <- function(sc, coll, numSlices = 1) { if (numSlices > length(coll)) numSlices <- length(coll) + sizeLimit <- as.numeric(sparkR.conf( + "spark.r.maxAllocationLimit", + toString(.Machine$integer.max / 2) # Default to a safe default: 200MB + )) + objectSize <- object.size(coll) + + # For large objects we make sure the size of each slice is also smaller than sizeLimit + numSlices <- max(numSlices, ceiling(objectSize / sizeLimit)) + sliceLen <- ceiling(length(coll) / numSlices) slices <- split(coll, rep(1: (numSlices + 1), each = sliceLen)[1:length(coll)]) # Serialize each slice: obtain a list of raws, or a list of lists (slices) of # 2-tuples of raws serializedSlices <- lapply(slices, serialize, connection = NULL) - jrdd <- callJStatic("org.apache.spark.api.r.RRDD", - "createRDDFromArray", sc, serializedSlices) + # The PRC backend cannot handle arguments larger than 2GB (INT_MAX) + # If serialized data is safely less than that threshold we send it over the PRC channel. + # Otherwise, we write it to a file and send the file name + if (objectSize < sizeLimit) { +jrdd <- callJStatic("org.apache.spark.api.r.RRDD", "createRDDFromArray", sc, serializedSlices) + } else { +fileName <- writeToTempFile(serializedSlices) +jrdd <- callJStatic( + "org.apache.spark.api.r.RRDD", "createRDDFromFile", sc, fileName, as.integer(numSlices)) +file.remove(fileName) --- End diff -- if the JVM call throws an exception, I don't think this line will execute, perhaps wrap this in tryCatch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15246: [MINOR][SQL] Use resource path for test_script.sh
Github user weiqingy commented on the issue: https://github.com/apache/spark/pull/15246 Hi, @srowen all tests passed this time. Could you please review this PR again? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790] Support for parallelizing R data.frame lar...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/15375 Odd, this is the error from appveyor: ``` ontext: Fail to set Spark caller context java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264) at org.apache.spark.util.CallerContext.setCurrentContext(Utils.scala:2485) at org.apache.spark.scheduler.Task.run(Task.scala:96) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/10/07 03:45:58 INFO Executor: Finished task 1. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15382: [SPARK-17810] [SQL] Default spark.sql.warehouse.dir is r...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15382 i think working dir makes more sense than home dir. but could this catch people by surprise because we now expect write permission in the working dir? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15375: [SPARK-17790] Support for parallelizing R data.fr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15375#discussion_r82331194 --- Diff: R/pkg/R/context.R --- @@ -126,13 +126,13 @@ parallelize <- function(sc, coll, numSlices = 1) { if (numSlices > length(coll)) numSlices <- length(coll) - sizeLimit <- .Machine$integer.max - 10240 # Safe margin bellow maximum allocation limit + sizeLimit <- as.numeric( --- End diff -- shouldn't this be `as.integer(`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15292#discussion_r82330973 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -17,47 +17,130 @@ package org.apache.spark.sql.execution.datasources.jdbc +import java.sql.{Connection, DriverManager} +import java.util.Properties + /** * Options for the JDBC data source. */ class JDBCOptions( @transient private val parameters: Map[String, String]) extends Serializable { + import JDBCOptions._ + + def this(url: String, table: String, parameters: Map[String, String]) = { +this(parameters ++ Map("url" -> url, "dbtable" -> table)) --- End diff -- Change them to `JDBC_URL` and `JDBC_TABLE_NAME`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13690: [SPARK-15767][R][ML] Decision Tree Regression wra...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/13690#discussion_r82330771 --- Diff: R/pkg/inst/tests/testthat/test_mllib.R --- @@ -791,4 +791,59 @@ test_that("spark.kstest", { expect_match(capture.output(stats)[1], "Kolmogorov-Smirnov test summary:") }) +test_that("spark.decisionTree Regression", { + data <- suppressWarnings(createDataFrame(longley)) + model <- spark.decisionTree(data, Employed~., "regression", maxDepth = 5, maxBins = 16) --- End diff -- could be more readable as `Employed ~ .` (with spaces) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15246: [MINOR][SQL] Use resource path for test_script.sh
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15246 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15246: [MINOR][SQL] Use resource path for test_script.sh
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15246 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66482/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15246: [MINOR][SQL] Use resource path for test_script.sh
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15246 **[Test build #66482 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66482/consoleFull)** for PR 15246 at commit [`1233aa2`](https://github.com/apache/spark/commit/1233aa25d751b94a610f6ac052411596cb0df10d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15361: [SPARK-17765][SQL] Support for writing out user-defined ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15361 @kxepal Sure, thanks for confirming! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15351: [SPARK-17612][SQL][branch-2.0] Support `DESCRIBE table P...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/15351 Merging to 2.0. @dongjoon-hyun can you close this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13690: [SPARK-15767][R][ML] Decision Tree Regression wra...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/13690#discussion_r82330422 --- Diff: R/pkg/R/mllib.R --- @@ -117,7 +132,7 @@ NULL #' @export #' @seealso \link{spark.glm}, \link{glm}, #' @seealso \link{spark.als}, \link{spark.gaussianMixture}, \link{spark.isoreg}, \link{spark.kmeans}, -#' @seealso \link{spark.mlp}, \link{spark.naiveBayes}, \link{spark.survreg} +#' @seealso \link{spark.mlp}, \link{spark.naiveBayes}, \link{spark.survreg}, \link{spark.decisionTree} --- End diff -- same here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15351: [SPARK-17612][SQL][branch-2.0] Support `DESCRIBE table P...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/15351 @dongjoon-hyun it LGTM. It is just a rather big patch to backport, for something that is not a bug fix. But I'll merge it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15361: [SPARK-17765][SQL] Support for writing out user-defined ...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/15361 @HyukjinKwon Oh, great news! It seems it's me backported this patch to 2.0.0 incorrectly. I'm sorry for false alarm then - suddenly, I wasn't able to test it with master. I'll do one more try today, but so far it looks like that you solved the problem \o/ Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13690: [SPARK-15767][R][ML] Decision Tree Regression wrapper in...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/13690 could you fix the test failure? ``` Duplicated \argument entries in documentation object 'spark.decisionTree': 'newData' '...' 'object' '...' 'x' ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15388 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66481/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15389: [SPARK-17817][PySpark] PySpark RDD Repartitioning Result...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15389 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15388 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15389: [SPARK-17817][PySpark] PySpark RDD Repartitioning Result...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15389 **[Test build #66485 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66485/consoleFull)** for PR 15389 at commit [`be8c509`](https://github.com/apache/spark/commit/be8c509a14506817cce500e845064a2ca7edcc23). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15292#discussion_r82329856 --- Diff: docs/sql-programming-guide.md --- @@ -1014,16 +1014,31 @@ bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9. {% endhighlight %} Tables from the remote database can be loaded as a DataFrame or Spark SQL Temporary table using -the Data Sources API. The following options are supported: +the Data Sources API. The following case-sensitive options are supported: Property NameMeaning url - The JDBC URL to connect to. + The JDBC URL to connect to. It might contain user and password information. e.g., jdbc:postgresql://localhost/test?user=fred=secret --- End diff -- Sure, that sounds more clean and correct. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15389: [SPARK-17817][PySpark] PySpark RDD Repartitioning Result...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15389 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66485/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15388 **[Test build #66481 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66481/consoleFull)** for PR 15388 at commit [`7e25355`](https://github.com/apache/spark/commit/7e2535554d5a0661490b74ff4422798d98063214). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15389: [SPARK-17817][PySpark] PySpark RDD Repartitioning Result...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15389 **[Test build #66485 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66485/consoleFull)** for PR 15389 at commit [`be8c509`](https://github.com/apache/spark/commit/be8c509a14506817cce500e845064a2ca7edcc23). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15389: [SPARK-17817][PySpark] PySpark RDD Repartitioning...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/15389 [SPARK-17817][PySpark] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes ## What changes were proposed in this pull request? Quoted from JIRA description: Calling repartition on a PySpark RDD to increase the number of partitions results in highly skewed partition sizes, with most having 0 rows. The repartition method should evenly spread out the rows across the partitions, and this behavior is correctly seen on the Scala side. Please reference the following code for a reproducible example of this issue: num_partitions = 2 a = sc.parallelize(range(int(1e6)), 2) # start with 2 even partitions l = a.repartition(num_partitions).glom().map(len).collect() # get length of each partition min(l), max(l), sum(l)/len(l), len(l) # skewed! In Scala's `repartition` code, we will distribute elements evenly across output partitions. However, the RDD from Python is serialized as a single binary data, so the distribution fails. We need to convert the RDD in Python to java object before repartitioning. ## How was this patch tested? Jenkins tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 pyspark-rdd-repartition Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15389.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15389 commit be8c509a14506817cce500e845064a2ca7edcc23 Author: Liang-Chi HsiehDate: 2016-10-07T04:59:37Z Fix pyspark.rdd repartition. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15292#discussion_r82329644 --- Diff: docs/sql-programming-guide.md --- @@ -1014,16 +1014,31 @@ bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9. {% endhighlight %} Tables from the remote database can be loaded as a DataFrame or Spark SQL Temporary table using -the Data Sources API. The following options are supported: +the Data Sources API. The following case-sensitive options are supported: Property NameMeaning url - The JDBC URL to connect to. + The JDBC URL to connect to. It might contain user and password information. e.g., jdbc:postgresql://localhost/test?user=fred=secret + + +user + + The user to connect as. --- End diff -- Sorry, after rethinking it, I think we do not need `user` and `password` here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15292#discussion_r82329571 --- Diff: docs/sql-programming-guide.md --- @@ -1014,16 +1014,31 @@ bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9. {% endhighlight %} Tables from the remote database can be loaded as a DataFrame or Spark SQL Temporary table using -the Data Sources API. The following options are supported: +the Data Sources API. The following case-sensitive options are supported: Property NameMeaning url - The JDBC URL to connect to. + The JDBC URL to connect to. It might contain user and password information. e.g., jdbc:postgresql://localhost/test?user=fred=secret --- End diff -- How about this change? _The source-specific connection properties may be specified in the URL. e.g., jdbc:postgresql://localhost/test?user=fred=secret_ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15218: [SPARK-17637][Scheduler]Packed scheduling for Spark task...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/15218 Btw, taking a step back, I am not sure this will work as you expect it to. Other than a few taskset's - those without locality information - the schedule is going to be highly biased towards the locality information supplied. This typically will mean PROCESS_LOCAL (almost always) and then NODE_LOCAL - which means, exactly match the executor or host (irrespective of the order we traverse the task list). The shuffle of offers we do is for a specific set of purposes - spread load if no locality information (not very common imo) or spread it across cluster when locality information is of more 'low quality' - like from an InputFormat or for shuffle when we are using heuristics which might not be optimal. But since I have not looked at this in a while, will CC kay. +CC @kayousterhout pls do take a look in case I am missing something. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15292#discussion_r82329341 --- Diff: docs/sql-programming-guide.md --- @@ -1014,16 +1014,31 @@ bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9. {% endhighlight %} Tables from the remote database can be loaded as a DataFrame or Spark SQL Temporary table using -the Data Sources API. The following options are supported: +the Data Sources API. The following case-sensitive options are supported: Property NameMeaning url - The JDBC URL to connect to. + The JDBC URL to connect to. It might contain user and password information. e.g., jdbc:postgresql://localhost/test?user=fred=secret --- End diff -- Actually, this is not accurate. Let me think about it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15364: [SPARK-17792][ML] L-BFGS solver for linear regres...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15364 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15364: [SPARK-17792][ML] L-BFGS solver for linear regression do...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/15364 Merged into master, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790] Support for parallelizing R data.frame lar...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15375 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790] Support for parallelizing R data.frame lar...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15375 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66472/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790] Support for parallelizing R data.frame lar...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15375 **[Test build #66472 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66472/consoleFull)** for PR 15375 at commit [`4aab6cf`](https://github.com/apache/spark/commit/4aab6cf4d6e2f05c1e893cbc6d05fcc1763ea0f4). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15381: [SPARK-17707] [WEBUI] Web UI prevents spark-submi...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/15381#discussion_r82326116 --- Diff: sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpCLIService.java --- @@ -90,8 +95,21 @@ public void run() { Arrays.toString(sslContextFactory.getExcludeProtocols())); sslContextFactory.setKeyStorePath(keyStorePath); sslContextFactory.setKeyStorePassword(keyStorePassword); -connector = new ServerConnector(httpServer, sslContextFactory); +connectionFactories = AbstractConnectionFactory.getFactories( --- End diff -- This will expose both http and https, and it's a behavior change. Right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15354: [SPARK-17764][SQL] Add `to_json` supporting to convert n...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15354 **[Test build #66484 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66484/consoleFull)** for PR 15354 at commit [`5f9fa29`](https://github.com/apache/spark/commit/5f9fa29a44b9f33cd90633e470d3dff2516499a9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14531: [SPARK-17353] [SPARK-16943] [SPARK-16942] [SQL] Fix mult...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14531 @sitalkedia Yeah, I saw it. Thank you for investigation. Normally, we do not want to add many configuration flags. It hurts the usability. Let @rxin make a decision whether we should add another flag or not. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15367: [SPARK-17346][SQL][test-maven]Add Kafka source for Struc...
Github user marmbrus commented on the issue: https://github.com/apache/spark/pull/15367 No, if we backport this I would plan to continue to backport changes (that are safe) until the next release. Either way this should not affect what goes into master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15385: [DO NOT MERGE]Try to reproduce DirectKafkaStreamS...
Github user zsxwing closed the pull request at: https://github.com/apache/spark/pull/15385 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15367: [SPARK-17346][SQL][test-maven]Add Kafka source for Struc...
Github user koeninger commented on the issue: https://github.com/apache/spark/pull/15367 Does backporting reduce the likelihood of change if user feedback indicates we got it wrong? My technical concerns were largely addressed, that's my big remaining organizational concern. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15307 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66483/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15385: [DO NOT MERGE]Try to reproduce DirectKafkaStreamSuite fa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15385 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15307 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15307 **[Test build #66483 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66483/consoleFull)** for PR 15307 at commit [`8537783`](https://github.com/apache/spark/commit/8537783abc495156d3f356e378d260c9222f2c46). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15385: [DO NOT MERGE]Try to reproduce DirectKafkaStreamSuite fa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15385 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66470/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15385: [DO NOT MERGE]Try to reproduce DirectKafkaStreamSuite fa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15385 **[Test build #66470 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66470/consoleFull)** for PR 15385 at commit [`0fc2da9`](https://github.com/apache/spark/commit/0fc2da9e7d35f645d8564d85389ff74f264d3d00). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15366: [SPARK-17793] [Web UI] Sorting on the description on the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15366 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66480/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15366: [SPARK-17793] [Web UI] Sorting on the description on the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15366 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15366: [SPARK-17793] [Web UI] Sorting on the description on the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15366 **[Test build #66480 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66480/consoleFull)** for PR 15366 at commit [`c1d2b2b`](https://github.com/apache/spark/commit/c1d2b2bd1e1a12791a180f1b753ca082c97df31c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15387: [SPARK-17782][STREAMING][KAFKA] eliminate race condition...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15387 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15387: [SPARK-17782][STREAMING][KAFKA] eliminate race condition...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15387 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66479/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15387: [SPARK-17782][STREAMING][KAFKA] eliminate race condition...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15387 **[Test build #66479 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66479/consoleFull)** for PR 15387 at commit [`aca55de`](https://github.com/apache/spark/commit/aca55de0624f5634acb04f91636dce79af875fab). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15387: [SPARK-17782][STREAMING][KAFKA] eliminate race condition...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15387 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15387: [SPARK-17782][STREAMING][KAFKA] eliminate race condition...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15387 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66477/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15387: [SPARK-17782][STREAMING][KAFKA] eliminate race condition...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15387 **[Test build #66477 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66477/consoleFull)** for PR 15387 at commit [`1fc5863`](https://github.com/apache/spark/commit/1fc5863db88cac9dfd0be09318c4ca8779a51682). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15307 **[Test build #66483 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66483/consoleFull)** for PR 15307 at commit [`8537783`](https://github.com/apache/spark/commit/8537783abc495156d3f356e378d260c9222f2c46). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets
Github user squito commented on the issue: https://github.com/apache/spark/pull/15249 @mridulm we had considered that approach earlier on as well -- I don't think it works because you can also have resources which are not totally broken, but are flaky for a long period of time. Simplest example is one bad disk out of many; some tasks may succeed though a bunch will fail. I've seen users hit this. But could be even more nuanced even, eg. a bad sector, flaky network connection, etc. In those cases, its intentional that in this implementation, one success does *not* un-blacklist anything. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15246: [MINOR][SQL] Use resource path for test_script.sh
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15246 **[Test build #66482 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66482/consoleFull)** for PR 15246 at commit [`1233aa2`](https://github.com/apache/spark/commit/1233aa25d751b94a610f6ac052411596cb0df10d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15307 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66478/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15307 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15307 **[Test build #66478 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66478/consoleFull)** for PR 15307 at commit [`10d1c24`](https://github.com/apache/spark/commit/10d1c243a71d464ada33db269a30ad0e4dff3ced). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15367: [SPARK-17346][SQL][test-maven]Add Kafka source for Struc...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/15367 @marmbrus @zsxwing I agree its experimental and we should have more flexibility here with backports. I also very much agree that structured streaming in its current state on 2.0 isn't usable - but I'm not super sure that backporting fixes is the best way to do this? Honestly I spend most of my time focused on Python & ML (and I've only really been looking at structured streaming with those two hats on). I'm really cautious about the idea 2k+ line backport which hasn't even been released otherwise but I don't have any specific objections to the changes its just making me nervous. The fact the whats being backported seems to still be under development is also concerning since doing this backport now puts us in a position of backporting more (not yet merged into mainline) fixes. Of course - If the people with the most experience in this area all agree (and most of y'all [ @marmbrus @zsxwing @tdas but maybe missing @koeninger ] seem to already be on this PR so I'll leave you to it) that this backport reasonable that is great - it would probably be good to follow up to the original backport mailing list thread and update the wiki as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15332: [SPARK-10364][SQL] Support Parquet logical type TIMESTAM...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15332 LGTM. see if @davies @liancheng have other comments about this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15388 **[Test build #66481 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66481/consoleFull)** for PR 15388 at commit [`7e25355`](https://github.com/apache/spark/commit/7e2535554d5a0661490b74ff4422798d98063214). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15388 cc @hvanhovell @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15388: [SPARK-17821][SQL] Support And and Or in Expressi...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/15388 [SPARK-17821][SQL] Support And and Or in Expression Canonicalize ## What changes were proposed in this pull request? Currently `Canonicalize` object doesn't support `And` and `Or`. So we can compare canonicalized form of predicates consistently. We should add the support. ## How was this patch tested? Jenkins tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 canonicalize-and-or Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15388.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15388 commit 7e2535554d5a0661490b74ff4422798d98063214 Author: Liang-Chi HsiehDate: 2016-10-07T02:54:34Z Support And and Or in Canonicalize. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15370: [SPARK-17417][Core] Fix # of partitions for Reliable RDD...
Github user dhruve commented on the issue: https://github.com/apache/spark/pull/15370 If we assume file name of the form "part-[0-9]+" * Case 1: *Entire RDD* => Verification of file name while reconstructing would be satisfied as we read all the checkpointed part files. * Case 2: *Specific Partition* => While trying to reconstruct a specific partition, this information would be insufficient to locate the actual part file. See `getPreferredLocations ` Should the filename be hdfs:////.../part-1 or part-01 or part-...1? Also, with the NumberFormat impl, files continue to be named upto 5 digits by default. Only when you exceed 10 it starts with 6 digits, 7 digits and so on. This takes care of the old format as well and handles the case exceeding the current limit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15354: [SPARK-17764][SQL] Add `to_json` supporting to co...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/15354#discussion_r82322230 --- Diff: python/pyspark/sql/functions.py --- @@ -1729,6 +1729,29 @@ def from_json(col, schema, options={}): return Column(jc) +@ignore_unicode_prefix +@since(2.1) +def to_json(col, options={}): +""" +Converts a column containing a [[StructType]] into a JSON string. Returns `null`, +in the case of an unsupported type. + +:param col: struct column +:param options: options to control converting. accepts the same options as the json datasource + +>>> from pyspark.sql import Row +>>> from pyspark.sql.types import * +>>> data = [(1, Row(name='Alice', age=2))] +>>> df = spark.createDataFrame(data, ("key", "value")) +>>> df.select(to_json(df.value).alias("json")).collect() +[Row(json=u'{"age":2,"name":"Alice"}')] +""" + +sc = SparkContext._active_spark_context +jc = sc._jvm.functions.to_json(_to_java_column(col), options) --- End diff -- actually nvm my original comment, the more I look at this file the less it seems the pattern is overly consistent and this same pattern is done elsewhere within the file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15379: [SPARK-17805][PYSPARK] Fix in sqlContext.read.tex...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/15379#discussion_r82321597 --- Diff: python/pyspark/sql/readwriter.py --- @@ -289,8 +289,8 @@ def text(self, paths): [Row(value=u'hello'), Row(value=u'this')] """ if isinstance(paths, basestring): -path = [paths] -return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(path))) +paths = [paths] +return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(paths))) --- End diff -- So I agree keeping path here kind of makes sense. Its unfortunate we didn't catch the difference in the named parameter difference between these reader functions back during 2.0. At this point changing the named parameter from paths to path we need to be a bit careful with incase people are using named params (if we did that we would need to add a version changed note and be careful). We could also have it (transitionally) take a kwargs work with either for a version (while updating the pydoc of course). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15246: [MINOR][SQL] Use resource path for test_script.sh
Github user weiqingy commented on a diff in the pull request: https://github.com/apache/spark/pull/15246#discussion_r82321891 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -66,13 +67,14 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { import spark.implicits._ test("script") { +val scriptFilePath = getPath("test_script.sh") if (testCommandAvailable("bash") && testCommandAvailable("echo | sed")) { val df = Seq(("x1", "y1", "z1"), ("x2", "y2", "z2")).toDF("c1", "c2", "c3") df.createOrReplaceTempView("script_table") val query1 = sql( -""" +s""" --- End diff -- Yes. Good catch. There are some odd corner cases for `s""" """`, but it should be OK here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15386: [SPARK-17808][PYSPARK] Upgraded version of Pyrolite to 4...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/15386 Thanks for working on this - the pylint script found a style problem (PEP8 checks failed. ./python/pyspark/sql/tests.py:1709:54: E231 missing whitespace after ',') - if you want to test the style locally first you can use ./dev/lint-python --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15218: [SPARK-17637][Scheduler]Packed scheduling for Spa...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/15218#discussion_r82321008 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala --- @@ -0,0 +1,151 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.PriorityQueue +import scala.util.Random + +import org.apache.spark.SparkConf + +case class OfferState(workOffer: WorkerOffer, var cores: Int) { + // Build a list of tasks to assign to each worker. + val tasks = new ArrayBuffer[TaskDescription](cores) +} + +abstract class TaskAssigner(conf: SparkConf) { + var offer: Seq[OfferState] = _ + val CPUS_PER_TASK = conf.getInt("spark.task.cpus", 1) + + // The final assigned offer returned to TaskScheduler. + def tasks(): Seq[ArrayBuffer[TaskDescription]] = offer.map(_.tasks) + + // construct the assigner by the workoffer. + def construct(workOffer: Seq[WorkerOffer]): Unit = { +offer = workOffer.map(o => OfferState(o, o.cores)) + } + + // Invoked in each round of Taskset assignment to initialize the internal structure. + def init(): Unit + + // Indicating whether there is offer available to be used by one round of Taskset assignment. + def hasNext(): Boolean + + // Next available offer returned to one round of Taskset assignment. + def getNext(): OfferState + + // Called by the TaskScheduler to indicate whether the current offer is accepted + // In order to decide whether the current is valid for the next offering. + def taskAssigned(assigned: Boolean): Unit + + // Release internally maintained resources. Subclass is responsible to + // release its own private resources. + def reset: Unit = { +offer = null + } +} + +class RoundRobinAssigner(conf: SparkConf) extends TaskAssigner(conf) { + var i = 0 + override def construct(workOffer: Seq[WorkerOffer]): Unit = { +offer = Random.shuffle(workOffer.map(o => OfferState(o, o.cores))) + } + override def init(): Unit = { +i = 0 + } + override def hasNext: Boolean = { +i < offer.size + } + override def getNext(): OfferState = { +offer(i) + } + override def taskAssigned(assigned: Boolean): Unit = { +i += 1 + } + override def reset: Unit = { +super.reset +i = 0 + } +} + +class BalancedAssigner(conf: SparkConf) extends TaskAssigner(conf) { --- End diff -- @mridulm Thanks for the comments. But I am lost here. My understanding is Ordering-wise, x is equal to y if x.cores == y.cores. This ordering is used by priority queue to construct the data structure. Following is an example from trait Ordering. PersonA will be equal to PersionB if they are the same age. Do I miss anything? * import scala.util.Sorting * * case class Person(name:String, age:Int) * val people = Array(Person("bob", 30), Person("ann", 32), Person("carl", 19)) * * // sort by age * object AgeOrdering extends Ordering[Person] { * def compare(a:Person, b:Person) = a.age compare b.age * } * Sorting.quickSort(people)(AgeOrdering) * }}} --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790] Support for parallelizing R data.frame lar...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15375 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790] Support for parallelizing R data.frame lar...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15375 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66467/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790] Support for parallelizing R data.frame lar...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15375 **[Test build #66467 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66467/consoleFull)** for PR 15375 at commit [`8e065c1`](https://github.com/apache/spark/commit/8e065c100389bd5e89f02ffb43319bb2089a44c5). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15365: [SPARK-17157][SPARKR]: Add multiclass logistic regressio...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15365 @felixcheung I fixed the cran errors. It is ready to review now. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11601: [SPARK-13568] [ML] Create feature transformer to impute ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/11601 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11601: [SPARK-13568] [ML] Create feature transformer to impute ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/11601 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66476/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11601: [SPARK-13568] [ML] Create feature transformer to impute ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/11601 **[Test build #66476 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66476/consoleFull)** for PR 11601 at commit [`91d4cee`](https://github.com/apache/spark/commit/91d4cee75a150ad2335dba0838c47cb4f0505ad8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15329: [SPARK-17763][SQL] JacksonParser silently parses null as...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15329 Hi @yhuai and @cloud-fan , I recall changing codes here was reviewed by you both. Do you mind if I ask to review this please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15246: [MINOR][SQL] Use resource path for test_script.sh
Github user weiqingy commented on a diff in the pull request: https://github.com/apache/spark/pull/15246#discussion_r82317624 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -17,6 +17,7 @@ package org.apache.spark.sql.hive.execution +import java.io.File --- End diff -- No. Will delete it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15246: [MINOR][SQL] Use resource path for test_script.sh
Github user weiqingy commented on a diff in the pull request: https://github.com/apache/spark/pull/15246#discussion_r82317563 --- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala --- @@ -41,6 +43,15 @@ abstract class SparkFunSuite } } + // helper function + protected final def getFile(file: String): File = { --- End diff -- @srowen getTestResourceFile and getTestResourcePath look better. Thanks. URL class doesn't have a method like getCanonicalFile. It has getFile only. Also, I tested Paths.get(... toURI).toFile. The only difference I noticed is that it keeps spaces as usual, but getFile(file).getCanonicalPath converts spaces to "%20". I suppose they are both OK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/15249 Thinking more, and based on what @squito mentioned, I was considering the following : Since we are primarily dealing with executor or nodes which are 'bad' as opposed to recoverable failures due to resource contention, prevention of degenerate corner cases which existing blacklist is for, etc : Can we assume a successful task execution on a node will imply healthy node ? What about at executor level ? Proposal is to keep the pr as is for the most part, but : - Clear nodeToExecsWithFailures when an task on an node succeeds. Same for nodeToBlacklistedTaskIndexes. - Not sure if we want to reset execToFailures for an executor (not clearing would imply we are handling resource starvation case implicitly imo). - If possible - allow for speculative tasks to be scheduled on blacklisted nodes/executors if it is possible for countTowardsTaskFailures to be overriden to false in those cases (if not, ignore this - since it will add towards number of failures per app). The rationale behind this is that successful tasks indicate past failures were not indicative of bad nodes/executors, but rather transient failures. And speculative tasks also sort of work as probe tasks to determine if the node/executor has recovered and is healthy. I hope I am not missing anything - any thoughts @squito, @kayousterhout, @tgravescs ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15361: [SPARK-17765][SQL] Support for writing out user-defined ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15361 Hi @kxepal , I just tested (copied and pasted) the codes below: ```scala import org.apache.spark.sql.SparkSession import spark.implicits._ val spark = SparkSession.builder().appName("Spark Hive Example").enableHiveSupport().getOrCreate() val sv = org.apache.spark.mllib.linalg.Vectors.sparse(7, Array(0, 42), Array(-127, 128)) val df = Seq(("thing", sv)).toDF("thing", "vector") df.write.format("orc").save("/tmp/thing.orc") ``` and it seems fine with the current master branch. Do you mind if I try to verify this again when be hopefully backport to branch-2.0? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15366: [SPARK-17793] [Web UI] Sorting on the description on the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15366 **[Test build #66480 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66480/consoleFull)** for PR 15366 at commit [`c1d2b2b`](https://github.com/apache/spark/commit/c1d2b2bd1e1a12791a180f1b753ca082c97df31c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15387: [SPARK-17782][STREAMING][KAFKA] eliminate race condition...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15387 **[Test build #66479 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66479/consoleFull)** for PR 15387 at commit [`aca55de`](https://github.com/apache/spark/commit/aca55de0624f5634acb04f91636dce79af875fab). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15355: [SPARK-17782][STREAMING] Disable Kafka 010 pattern based...
Github user koeninger commented on the issue: https://github.com/apache/spark/pull/15355 @zsxwing good eye, thanks. It's not that auto.offset.reset.earliest doesn't work, it's that there's a potential race condition that poll gets called twice slowly enough for consumer position to be modified before topicpartitions are paused. https://github.com/apache/spark/pull/15387 should address that. It's something that whoever works on the duplicated equivalent code in the structured streaming module is going to have to address, also. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15366: [SPARK-17793] [Web UI] Sorting on the description on the...
Github user ajbozarth commented on the issue: https://github.com/apache/spark/pull/15366 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15387: [SPARK-17782][STREAMING][KAFKA] eliminate race condition...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15387 **[Test build #66477 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66477/consoleFull)** for PR 15387 at commit [`1fc5863`](https://github.com/apache/spark/commit/1fc5863db88cac9dfd0be09318c4ca8779a51682). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15307 **[Test build #66478 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66478/consoleFull)** for PR 15307 at commit [`10d1c24`](https://github.com/apache/spark/commit/10d1c243a71d464ada33db269a30ad0e4dff3ced). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15379: [SPARK-17805][PYSPARK] Fix in sqlContext.read.text when ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15379 +1 for this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15387: [SPARK-17782][STREAMING][KAFKA] eliminate race co...
GitHub user koeninger opened a pull request: https://github.com/apache/spark/pull/15387 [SPARK-17782][STREAMING][KAFKA] eliminate race condition of poll twice ## What changes were proposed in this pull request? Kafka consumers can't subscribe or maintain heartbeat without polling, but polling ordinarily consumes messages and adjusts position. We don't want this on the driver, so we poll with a timeout of 0 and pause all topicpartitions. Some consumer strategies that seek to particular positions have to poll first, but they weren't pausing immediately thereafter. Thus, there was a race condition where the second poll() in the DStream start method might actually adjust consumer position. Eliminated (or at least drastically reduced the chance of) the race condition via pausing in the relevant consumer strategies, and assert on startup that no messages were consumed. ## How was this patch tested? I reliably reproduced the intermittent test failure by inserting a thread.sleep directly before returning from SubscribePattern. The suggested fix eliminated the failure. You can merge this pull request into a Git repository by running: $ git pull https://github.com/koeninger/spark-1 SPARK-17782 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15387.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15387 commit 1fc5863db88cac9dfd0be09318c4ca8779a51682 Author: cody koeningerDate: 2016-10-07T01:08:01Z [SPARK-17782][STREAMING][KAFKA] eliminate race condition of poll being called twice and moving position --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15366: [SPARK-17793] [Web UI] Sorting on the description on the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15366 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66469/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15379: [SPARK-17805][PYSPARK] Fix in sqlContext.read.tex...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15379#discussion_r82315908 --- Diff: python/pyspark/sql/readwriter.py --- @@ -289,8 +289,8 @@ def text(self, paths): [Row(value=u'hello'), Row(value=u'this')] """ if isinstance(paths, basestring): -path = [paths] -return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(path))) +paths = [paths] +return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(paths))) --- End diff -- This is a super minor but I think it'd be nicer to match up the variable name to `path` if this makes sense. For parquet, it takes non-keyword arguments so it seems `paths` but for others, it seems take a single argument. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15366: [SPARK-17793] [Web UI] Sorting on the description on the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15366 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15366: [SPARK-17793] [Web UI] Sorting on the description on the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15366 **[Test build #66469 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66469/consoleFull)** for PR 15366 at commit [`c1d2b2b`](https://github.com/apache/spark/commit/c1d2b2bd1e1a12791a180f1b753ca082c97df31c). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15307 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15307 **[Test build #66475 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66475/consoleFull)** for PR 15307 at commit [`2918525`](https://github.com/apache/spark/commit/29185254d325834c40bd63a543317950b2794b30). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15307 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66475/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org