[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/18320 Does it fail by running just gapply and nothing else? From what you have found in your investigations and the code you pointed to, I suspect this isn't limited to gapply. I think this PR so only works around the problem. I am concern that an user can also run into this issue. An naive approach might be to change `park.sparkr.use.daemon` inside gapply when it is called, but I suspect that only shifts the problem around, and it might fail then with other methods that shuffles or calls UDFs. If a long running demon process is the problem, either we find and fix the leak (close the pipe, socket etc) or we put a count on the number of execution and re-cycle the demon process periodically before this leak becomes fatal. thought? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17451: [SPARK-19866][ML][PySpark] Add local version of Word2Vec...
Github user keypointt commented on the issue: https://github.com/apache/spark/pull/17451 no worries Holden, totally understood thank you for the input and I'll try it out ð --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18231: [SPARK-20994] Remove redundant characters in OpenBlocks ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18231 **[Test build #78157 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78157/testReport)** for PR 18231 at commit [`5b0ce67`](https://github.com/apache/spark/commit/5b0ce674fb3070c6749f9caf8cbbbeabb702ce01). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18231: [SPARK-20994] Remove redundant characters in Open...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18231#discussion_r122367121 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java --- @@ -209,4 +190,51 @@ private ShuffleMetrics() { } } + private class ManagedBufferIterator implements Iterator { + +private int index = 0; +private final String appId; +private final String execId; +private final int shuffleId; +// An array containing mapId and reduceId pairs. +private final int[] mapIdAndReduceIds; + +ManagedBufferIterator(String appId, String execId, String[] blockIds) { + this.appId = appId; + this.execId = execId; + String[] blockId0Parts = blockIds[0].split("_"); + if (blockId0Parts.length < 4 || !blockId0Parts[0].equals("shuffle")) { --- End diff -- Sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18231: [SPARK-20994] Remove redundant characters in Open...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18231#discussion_r122366821 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java --- @@ -209,4 +190,51 @@ private ShuffleMetrics() { } } + private class ManagedBufferIterator implements Iterator { + +private int index = 0; +private final String appId; +private final String execId; +private final int shuffleId; +// An array containing mapId and reduceId pairs. +private final int[] mapIdAndReduceIds; + +ManagedBufferIterator(String appId, String execId, String[] blockIds) { + this.appId = appId; + this.execId = execId; + String[] blockId0Parts = blockIds[0].split("_"); + if (blockId0Parts.length < 4 || !blockId0Parts[0].equals("shuffle")) { --- End diff -- use `blockId0Parts.length != 4`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18284: [SPARK-21072][SQL] TreeNode.mapChildren should only appl...
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/18284 thanks everyone for reviewing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17702 **[Test build #78156 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78156/testReport)** for PR 17702 at commit [`a3a3509`](https://github.com/apache/spark/commit/a3a3509ca72a57d9df97e6ce50c16c1b40acfbb9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18239: [SPARK-19462] fix bug in Exchange--pass in a tmp "newPar...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18239 @cloud-fan Thanks a lot for reply. Yes, I'm also hesitate to backport branch-1.6; But I think this bug is too obvious -- with `spark.sql.adaptive.enabled=true`, any rerunning of `ShuffleMapStage` will fail. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18321: [SPARK-12552][FOLLOWUP] Fix flaky test for "o.a.s...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/18321#discussion_r122365295 --- Diff: core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala --- @@ -214,7 +214,7 @@ class MasterSuite extends SparkFunSuite master.rpcEnv.setupEndpoint(Master.ENDPOINT_NAME, master) // Wait until Master recover from checkpoint data. eventually(timeout(5 seconds), interval(100 milliseconds)) { -master.idToApp.size should be(1) +master.workers.size should be(1) --- End diff -- yes, that's right. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18268: [SPARK-21054] [SQL] Reset Command support reset specific...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/18268 Hive supports reset multiple keys like: `reset config1 config2`, should we also support that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18231: [SPARK-20994] Remove redundant characters in OpenBlocks ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18231 **[Test build #78155 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78155/testReport)** for PR 18231 at commit [`2592ef4`](https://github.com/apache/spark/commit/2592ef40e16382e80072b4d51273120443aef3fa). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18231: [SPARK-20994] Remove redundant characters in OpenBlocks ...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18231 @cloud-fan Thanks a lot for taking time review this. I refined accordingly :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17702: [SPARK-20408][SQL] Get the glob path in parallel ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/17702#discussion_r122364493 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -389,6 +389,23 @@ case class DataSource( } /** + * Return all paths represented by the wildcard string. + */ + private def getGlobbedPaths(qualified: Path): Seq[Path] = { --- End diff -- You are right. I'll fix this and also limit the max parallelism num in next patch, reuse the config in `InMemoryFileIndex.bulkListLeafFiles`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18319: [SPARK-21114] [TEST] [2.1] Fix test failure in Sp...
Github user gatorsmile closed the pull request at: https://github.com/apache/spark/pull/18319 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18320 @felixcheung, BTW, is it okay as a PR alone as is? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18318: [SPARK-21112] [SQL] ALTER TABLE SET TBLPROPERTIES...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18318#discussion_r122363898 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -235,7 +235,7 @@ case class AlterTableSetPropertiesCommand( // direct property. val newTable = table.copy( properties = table.properties ++ properties, - comment = properties.get("comment")) + comment = properties.get("comment").orElse(table.comment)) --- End diff -- alter table src set tblproperties ('foo' = 'bar', 'comment' = 'table_comment'); alter table src unset tblproperties ('foo'); we will lost comment in this case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18162: [SPARK-20923] turn tracking of TaskMetrics._updat...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18162#discussion_r122363701 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala --- @@ -528,7 +528,13 @@ class JobProgressListener(conf: SparkConf) extends SparkListener with Logging { new StageUIData }) val taskData = stageData.taskData.get(taskId) - val metrics = TaskMetrics.fromAccumulatorInfos(accumUpdates) + val accumsFiltered = if (conf.get(TASK_METRICS_TRACK_UPDATED_BLOCK_STATUSES)) { +accumUpdates + } else { +accumUpdates.filter(info => info.name.isDefined && info.update.isDefined && info.name != --- End diff -- to be more clear, I think we should just do an assert here to make sure there is not UPDATED_BLOCK_STATUSES accumulator updates, instead of doing a filter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18303: [SPARK-19824][Core] Update JsonProtocol to keep consiste...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18303 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18285: [SPARK-20338][CORE]Spaces in spark.eventLog.dir are not ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18285 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18285: [SPARK-20338][CORE]Spaces in spark.eventLog.dir are not ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18285 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78144/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18285: [SPARK-20338][CORE]Spaces in spark.eventLog.dir are not ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18285 **[Test build #78144 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78144/testReport)** for PR 18285 at commit [`536a445`](https://github.com/apache/spark/commit/536a4456637cc3b1db0445c61f4192520d27a9ef). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18303: [SPARK-19824][Core] Update JsonProtocol to keep consiste...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18303 **[Test build #78154 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78154/testReport)** for PR 18303 at commit [`244bbae`](https://github.com/apache/spark/commit/244bbae71c2aa0b9f173ad7ac16ad0440eaab99c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18303: [SPARK-19824][Core] Update JsonProtocol to keep consiste...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18303 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18320 I suspect this is an issue in R. I will raise this issue in R community soon and share it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18239: [SPARK-19462] fix bug in Exchange--pass in a tmp "newPar...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18239 I can hardly remember the code of Spark 1.6 and I'm not sure when is the next release of the 1.6 branch. BTW this bug can be worked around by turning off `spark.sql.adaptive.enabled`, do we really wanna spend time on it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17758: [SPARK-20460][SQL] Make it more consistent to handle col...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/17758 @wzhfy Applied. Could u check again? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18025: [SPARK-20889][SparkR] Grouped documentation for A...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/18025#discussion_r122362550 --- Diff: R/pkg/R/generics.R --- @@ -919,10 +920,9 @@ setGeneric("array_contains", function(x, value) { standardGeneric("array_contain #' @export setGeneric("ascii", function(x) { standardGeneric("ascii") }) -#' @param x Column to compute on or a GroupedData object. --- End diff -- yes, that's one of the code-gen methods that don't actually have documentation (which is a problem) but somehow inherit one from base:: that CRAN check doesn't complain about it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18231: [SPARK-20994] Remove redundant characters in OpenBlocks ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18231 LGTM except some minor comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18231: [SPARK-20994] Remove redundant characters in Open...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18231#discussion_r122362155 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java --- @@ -209,4 +190,51 @@ private ShuffleMetrics() { } } + private class ManagedBufferIterator implements Iterator { + +private int index = 0; +private final String appId; +private final String execId; +private final int shuffleId; +// An array containing mapId and reduceId pairs. +private final int[] mapIdAndReduceIds; + +ManagedBufferIterator(String appId, String execId, String[] blockIds) { + this.appId = appId; + this.execId = execId; + String[] blockId0Parts = blockIds[0].split("_"); + if (blockId0Parts.length < 4) { +throw new IllegalArgumentException("Unexpected block id format: " + blockIds[0]); + } + if (!blockId0Parts[0].equals("shuffle")) { +throw new IllegalArgumentException("Expected shuffle block id, got: " + blockIds[0]); + } + this.shuffleId = Integer.parseInt(blockId0Parts[1]); + mapIdAndReduceIds = new int[2 * blockIds.length]; + for (int i = 0; i < blockIds.length; i++) { +String[] blockIdParts = blockIds[i].split("_"); +if (Integer.parseInt(blockIdParts[1]) != shuffleId) { + throw new IllegalArgumentException("Expected shuffleId=" + shuffleId + +", got:" + blockIds[i]); +} +mapIdAndReduceIds[2 * i] = Integer.parseInt(blockIdParts[2]); +mapIdAndReduceIds[2 * i + 1] = Integer.parseInt(blockIdParts[3]); + } +} + +@Override +public boolean hasNext() { + return index < mapIdAndReduceIds.length / 2; --- End diff -- nit: we can keep a `pos`, and increase it by 2 in `next`, so here we can just write `pos < mapIdAndReduceIds.length` to save a division. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18025: [SPARK-20889][SparkR] Grouped documentation for AGGREGAT...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18025 **[Test build #78153 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78153/testReport)** for PR 18025 at commit [`19d063c`](https://github.com/apache/spark/commit/19d063c6995fa6bd780830a941f6b1f7c45c1bac). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18025: [SPARK-20889][SparkR] Grouped documentation for AGGREGAT...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18025 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18231: [SPARK-20994] Remove redundant characters in Open...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18231#discussion_r122361985 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java --- @@ -209,4 +190,51 @@ private ShuffleMetrics() { } } + private class ManagedBufferIterator implements Iterator { + +private int index = 0; +private final String appId; +private final String execId; +private final int shuffleId; +// An array containing mapId and reduceId pairs. +private final int[] mapIdAndReduceIds; + +ManagedBufferIterator(String appId, String execId, String[] blockIds) { + this.appId = appId; + this.execId = execId; + String[] blockId0Parts = blockIds[0].split("_"); + if (blockId0Parts.length < 4) { +throw new IllegalArgumentException("Unexpected block id format: " + blockIds[0]); + } + if (!blockId0Parts[0].equals("shuffle")) { +throw new IllegalArgumentException("Expected shuffle block id, got: " + blockIds[0]); + } + this.shuffleId = Integer.parseInt(blockId0Parts[1]); + mapIdAndReduceIds = new int[2 * blockIds.length]; + for (int i = 0; i < blockIds.length; i++) { +String[] blockIdParts = blockIds[i].split("_"); --- End diff -- shall we check `blockIdParts[0] == "shufle"`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18025: [SPARK-20889][SparkR] Grouped documentation for AGGREGAT...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18025 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78153/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18231: [SPARK-20994] Remove redundant characters in Open...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18231#discussion_r122361955 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java --- @@ -209,4 +190,51 @@ private ShuffleMetrics() { } } + private class ManagedBufferIterator implements Iterator { + +private int index = 0; +private final String appId; +private final String execId; +private final int shuffleId; +// An array containing mapId and reduceId pairs. +private final int[] mapIdAndReduceIds; + +ManagedBufferIterator(String appId, String execId, String[] blockIds) { + this.appId = appId; + this.execId = execId; + String[] blockId0Parts = blockIds[0].split("_"); + if (blockId0Parts.length < 4) { --- End diff -- shall we be more strict and use `blockId0Parts.length != 4`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18025: [SPARK-20889][SparkR] Grouped documentation for AGGREGAT...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18025 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18320 Yes, I guess it will pass if we reduce `spark.sql.shuffle.partitions` (< I didn't look carefully and test this either). Just to make sure (and to share what I investigated ...), from my code read, With `spark.sparkr.use.daemon` enabled, for each task execution, 1. JVM start (if not started)> R daemon 2. JVM send port --> R daemon fork with the port---> R worker This looks being tested on OSs except for Windows. With `spark.sparkr.use.daemon` disabled, for each task execution, 1. JVM forking processes from Java (expensive)-> R worker This looks being tested only on Windows. This PR proposes to switch this one to latter case (which was the former before) by avoiding calling the (already running from other execution) R daemon. I am fine with giving a shot with reducing the number of partitions if you are fond of it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18025: [SPARK-20889][SparkR] Grouped documentation for AGGREGAT...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18025 **[Test build #78152 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78152/testReport)** for PR 18025 at commit [`0a7f5fc`](https://github.com/apache/spark/commit/0a7f5fcac2e0295d92b82d8909c4f1b11c82f016). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18025: [SPARK-20889][SparkR] Grouped documentation for AGGREGAT...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18025 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78152/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18319: [SPARK-21114] [TEST] [2.1] Fix test failure in Spark 2.1...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18319 thanks, merging to 2.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18268: [SPARK-21054] [SQL] Reset Command support reset s...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18268#discussion_r122361086 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala --- @@ -301,4 +301,10 @@ class SparkSqlParserSuite extends PlanTest { "SELECT a || b || c FROM t", Project(UnresolvedAlias(concat) :: Nil, UnresolvedRelation(TableIdentifier("t" } + + test("reset") { +assertEqual("reset", ResetCommand(None)) +assertEqual("reset spark.test.property", ResetCommand(Some("spark.test.property"))) +assertEqual("reset #$a!", ResetCommand(Some("#$a!"))) --- End diff -- can we check hive's behavior? I think special chars are not allowed in config name and parser should throw exception for this case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18319: [SPARK-21114] [TEST] [2.1] Fix test failure in Spark 2.1...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18319 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78143/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18319: [SPARK-21114] [TEST] [2.1] Fix test failure in Spark 2.1...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18319 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18319: [SPARK-21114] [TEST] [2.1] Fix test failure in Spark 2.1...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18319 **[Test build #78143 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78143/testReport)** for PR 18319 at commit [`367e8e5`](https://github.com/apache/spark/commit/367e8e526e1f9b631765626b43767dcc16a037e6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18284: [SPARK-21072][SQL] TreeNode.mapChildren should on...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18284 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18284: [SPARK-21072][SQL] TreeNode.mapChildren should only appl...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18284 thanks, merging to master/2.2/2.1! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18321: [SPARK-12552][FOLLOWUP] Fix flaky test for "o.a.s.deploy...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18321 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/18320 that's very interesting. that code has been around for 2 years - to be honest I'm not 100% sure about what it is doing. perhaps this could also be fixed with a lower number of partitions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18318: [SPARK-21112] [SQL] ALTER TABLE SET TBLPROPERTIES should...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18318 Only the master branch has such an issue. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18318: [SPARK-21112] [SQL] ALTER TABLE SET TBLPROPERTIES...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18318#discussion_r122359957 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -235,7 +235,7 @@ case class AlterTableSetPropertiesCommand( // direct property. val newTable = table.copy( properties = table.properties ++ properties, - comment = properties.get("comment")) + comment = properties.get("comment").orElse(table.comment)) --- End diff -- Could you show an example? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user leifwalsh commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r122359928 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1648,8 +1650,30 @@ def toPandas(self): 02 Alice 15Bob """ -import pandas as pd -return pd.DataFrame.from_records(self.collect(), columns=self.columns) +if self.sql_ctx.getConf("spark.sql.execution.arrow.enable", "false").lower() == "true": +try: +import pyarrow +tables = self._collectAsArrow() +table = pyarrow.concat_tables(tables) --- End diff -- If tables is an empty list (e.g. if you load a dataset, filter the whole thing, and produce zero rows), `pyarrow.concat_tables` raises an exception rather than producing an empty table. This should probably be fixed in arrow (cc @wesm) but we should be defensive here. Probably should try to produce a `DataFrame` with the right schema but no rows if possible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18025: [SPARK-20889][SparkR] Grouped documentation for AGGREGAT...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18025 **[Test build #78149 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78149/testReport)** for PR 18025 at commit [`014b9f3`](https://github.com/apache/spark/commit/014b9f3069a6e2075cb8be307c5d74081dabe15a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18025: [SPARK-20889][SparkR] Grouped documentation for AGGREGAT...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18025 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78149/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18025: [SPARK-20889][SparkR] Grouped documentation for AGGREGAT...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18025 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15821: [SPARK-13534][PySpark] Using Apache Arrow to increase pe...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15821 yea I think it's fine to keep `ArrowPayload` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18283: [TEST][SPARKR][CORE] Fix broken SparkSubmitSuite
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/18283 @shaneknapp right - this script (install-dev.sh) has been assuming it can find `jar` without checking for JAVA_HOME, so I was saying it could be improved that way; but yea this script hasn't been changed for years also.. let me know if it recurs - I could just fix it by checking JAVA_HOME. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r122359743 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala --- @@ -0,0 +1,423 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.execution.arrow + +import java.io.ByteArrayOutputStream +import java.nio.channels.Channels + +import scala.collection.JavaConverters._ + +import io.netty.buffer.ArrowBuf +import org.apache.arrow.memory.{BufferAllocator, RootAllocator} +import org.apache.arrow.vector._ +import org.apache.arrow.vector.BaseValueVector.BaseMutator +import org.apache.arrow.vector.file._ +import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch} +import org.apache.arrow.vector.types.FloatingPointPrecision +import org.apache.arrow.vector.types.pojo.{ArrowType, Field, FieldType, Schema} +import org.apache.arrow.vector.util.ByteArrayReadableSeekableByteChannel + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types._ +import org.apache.spark.util.Utils + + +/** + * Store Arrow data in a form that can be serialized by Spark. + */ +private[sql] class ArrowPayload(payload: Array[Byte]) extends Serializable { + + /** + * Create an ArrowPayload from an ArrowRecordBatch and Spark schema. + */ + def this(batch: ArrowRecordBatch, schema: StructType, allocator: BufferAllocator) = { +this(ArrowConverters.batchToByteArray(batch, schema, allocator)) --- End diff -- sounds good --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r122359727 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala --- @@ -0,0 +1,1218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.arrow + +import java.io.File +import java.nio.charset.StandardCharsets +import java.sql.{Date, Timestamp} +import java.text.SimpleDateFormat +import java.util.Locale + +import com.google.common.io.Files +import org.apache.arrow.memory.RootAllocator +import org.apache.arrow.vector.{VectorLoader, VectorSchemaRoot} +import org.apache.arrow.vector.file.json.JsonFileReader +import org.apache.arrow.vector.util.Validator +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.SparkException +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.test.SharedSQLContext +import org.apache.spark.sql.types.{BinaryType, StructField, StructType} +import org.apache.spark.util.Utils + + +class ArrowConvertersSuite extends SharedSQLContext with BeforeAndAfterAll { + import testImplicits._ + + private var tempDataPath: String = _ + + override def beforeAll(): Unit = { +super.beforeAll() +tempDataPath = Utils.createTempDir(namePrefix = "arrow").getAbsolutePath + } + + test("collect to arrow record batch") { +val indexData = (1 to 6).toDF("i") +val arrowPayloads = indexData.toArrowPayload.collect() +assert(arrowPayloads.nonEmpty) +assert(arrowPayloads.length == indexData.rdd.getNumPartitions) +val allocator = new RootAllocator(Long.MaxValue) +val arrowRecordBatches = arrowPayloads.map(_.loadBatch(allocator)) +val rowCount = arrowRecordBatches.map(_.getLength).sum +assert(rowCount === indexData.count()) +arrowRecordBatches.foreach(batch => assert(batch.getNodes.size() > 0)) +arrowRecordBatches.foreach(_.close()) +allocator.close() + } + + test("short conversion") { +val json = + s""" + |{ + | "schema" : { + |"fields" : [ { + | "name" : "a_s", + | "type" : { + |"name" : "int", + |"isSigned" : true, + |"bitWidth" : 16 + | }, + | "nullable" : false, + | "children" : [ ], + | "typeLayout" : { + |"vectors" : [ { + | "type" : "VALIDITY", + | "typeBitWidth" : 1 + |}, { + | "type" : "DATA", + | "typeBitWidth" : 16 + |} ] + | } + |}, { + | "name" : "b_s", + | "type" : { + |"name" : "int", + |"isSigned" : true, + |"bitWidth" : 16 + | }, + | "nullable" : true, + | "children" : [ ], + | "typeLayout" : { + |"vectors" : [ { + | "type" : "VALIDITY", + | "typeBitWidth" : 1 + |}, { + | "type" : "DATA", + | "typeBitWidth" : 16 + |} ] + | } + |} ] + | }, + | "batches" : [ { + |"count" : 6, + |"columns" : [ { + | "name" : "a_s", + | "count" : 6, + | "VALIDITY" : [ 1, 1, 1, 1, 1, 1 ], + | "DATA" : [ 1, -1, 2, -2, 32767, -32768 ] + |}, { + | "name" : "b_s", + | "count" : 6, + | "VALIDITY" : [ 1, 0, 0, 1, 0, 1 ], + | "DATA" : [ 1, 0, 0, -2, 0, -32768 ] + |} ] +
[GitHub] spark issue #18249: [SPARK-19937] Collect metrics for remote bytes read to d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18249 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78140/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18249: [SPARK-19937] Collect metrics for remote bytes read to d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18249 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r122359492 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala --- @@ -0,0 +1,423 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.execution.arrow + +import java.io.ByteArrayOutputStream +import java.nio.channels.Channels + +import scala.collection.JavaConverters._ + +import io.netty.buffer.ArrowBuf +import org.apache.arrow.memory.{BufferAllocator, RootAllocator} +import org.apache.arrow.vector._ +import org.apache.arrow.vector.BaseValueVector.BaseMutator +import org.apache.arrow.vector.file._ +import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch} +import org.apache.arrow.vector.types.FloatingPointPrecision +import org.apache.arrow.vector.types.pojo.{ArrowType, Field, FieldType, Schema} +import org.apache.arrow.vector.util.ByteArrayReadableSeekableByteChannel + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types._ +import org.apache.spark.util.Utils + + +/** + * Store Arrow data in a form that can be serialized by Spark. + */ +private[sql] class ArrowPayload(payload: Array[Byte]) extends Serializable { + + /** + * Create an ArrowPayload from an ArrowRecordBatch and Spark schema. + */ + def this(batch: ArrowRecordBatch, schema: StructType, allocator: BufferAllocator) = { +this(ArrowConverters.batchToByteArray(batch, schema, allocator)) + } + + /** + * Convert the ArrowPayload to an ArrowRecordBatch. + */ + def loadBatch(allocator: BufferAllocator): ArrowRecordBatch = { +ArrowConverters.byteArrayToBatch(payload, allocator) + } + + /** + * Get the ArrowPayload as an Array[Byte]. + */ + def toByteArray: Array[Byte] = payload +} + +private[sql] object ArrowConverters { + + /** + * Map a Spark DataType to ArrowType. + */ + private[arrow] def sparkTypeToArrowType(dataType: DataType): ArrowType = { +dataType match { + case BooleanType => ArrowType.Bool.INSTANCE + case ShortType => new ArrowType.Int(8 * ShortType.defaultSize, true) + case IntegerType => new ArrowType.Int(8 * IntegerType.defaultSize, true) + case LongType => new ArrowType.Int(8 * LongType.defaultSize, true) + case FloatType => new ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE) + case DoubleType => new ArrowType.FloatingPoint(FloatingPointPrecision.DOUBLE) + case ByteType => new ArrowType.Int(8, true) + case StringType => ArrowType.Utf8.INSTANCE + case BinaryType => ArrowType.Binary.INSTANCE + case _ => throw new UnsupportedOperationException(s"Unsupported data type: $dataType") +} + } + + /** + * Convert a Spark Dataset schema to Arrow schema. + */ + private[arrow] def schemaToArrowSchema(schema: StructType): Schema = { +val arrowFields = schema.fields.map { f => + new Field(f.name, f.nullable, sparkTypeToArrowType(f.dataType), List.empty[Field].asJava) +} +new Schema(arrowFields.toList.asJava) + } + + /** + * Maps Iterator from InternalRow to ArrowPayload. Limit ArrowRecordBatch size in ArrowPayload + * by setting maxRecordsPerBatch or use 0 to fully consume rowIter. + */ + private[sql] def toPayloadIterator( + rowIter: Iterator[InternalRow], + schema: StructType, + maxRecordsPerBatch: Int): Iterator[ArrowPayload] = { +new Iterator[ArrowPayload] { + private val _allocator = new RootAllocator(Long.MaxValue) + private var _nextPayload = if (rowIter.nonEmpty) convert() else null + + override def hasNext: Boolean = _nextPayload != null + + override def next(): ArrowPayload = { +val obj = _nextPayload +if (hasNext) { +
[GitHub] spark issue #18249: [SPARK-19937] Collect metrics for remote bytes read to d...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18249 **[Test build #78140 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78140/testReport)** for PR 18249 at commit [`9768860`](https://github.com/apache/spark/commit/9768860046f69530926215ef1ec5162213a20616). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18025: [SPARK-20889][SparkR] Grouped documentation for AGGREGAT...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18025 **[Test build #78153 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78153/testReport)** for PR 18025 at commit [`19d063c`](https://github.com/apache/spark/commit/19d063c6995fa6bd780830a941f6b1f7c45c1bac). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18320 For normal usecases, I carefully suspect it might be fine because I executed 200 * ~10 tasks in a single machine quickly but I don't know if it happens frequently when it runs slowly in a cluster in a distributed manner. At least, this was not reproduced when the number of fork executions is not many. Practically, it might be fine but need more investigation if this is important to prioritize this issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18075#discussion_r122359345 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -233,10 +222,124 @@ class CodegenContext { // The collection of sub-expression result resetting methods that need to be called on each row. val subexprFunctions = mutable.ArrayBuffer.empty[String] - def declareAddedFunctions(): String = { -addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n") + /** + * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for + * whichever class is generated as the outermost class and which will contain any nested + * sub-classes. All other classes and instance names in this list will represent private, nested + * sub-classes. + */ + private val classes: mutable.ListBuffer[(String, String)] = +mutable.ListBuffer[(String, String)]("OuterClass" -> null) + + // A map holding the current size in bytes of each class to be generated. + private val classSize: mutable.Map[String, Int] = +mutable.Map[String, Int]("OuterClass" -> 0) + + // Nested maps holding function names and their code belonging to each class. + private val classFunctions: mutable.Map[String, mutable.Map[String, String]] = +mutable.Map("OuterClass" -> mutable.Map.empty[String, String]) + + // Returns the size of the most recently added class. + private def currClassSize(): Int = classSize(classes.head._1) + + // Returns the class name and instance name for the most recently added class. + private def currClass(): (String, String) = classes.head + + // Adds a new class. Requires the class' name, and its instance name. + private def addClass(className: String, classInstance: String): Unit = { +classes.prepend(className -> classInstance) +classSize += className -> 0 +classFunctions += className -> mutable.Map.empty[String, String] } + /** + * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the + * function will be inlined into a new private, nested class, and a class-qualified name for the + * function will be returned. Otherwise, the function will be inined to the `OuterClass` the + * simple `funcName` will be returned. + * + * @param funcName the class-unqualified name of the function + * @param funcCode the body of the function + * @param inlineToOuterClass whether the given code must be inlined to the `OuterClass`. This --- End diff -- yup, whole stage codegen is really tricky... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18321: [SPARK-12552][FOLLOWUP] Fix flaky test for "o.a.s...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/18321#discussion_r122359325 --- Diff: core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala --- @@ -214,7 +214,7 @@ class MasterSuite extends SparkFunSuite master.rpcEnv.setupEndpoint(Master.ENDPOINT_NAME, master) // Wait until Master recover from checkpoint data. eventually(timeout(5 seconds), interval(100 milliseconds)) { -master.idToApp.size should be(1) +master.workers.size should be(1) --- End diff -- I think the reason is workers are recovered later than applications. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18321: [SPARK-12552][FOLLOWUP] Fix flaky test for "o.a.s...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18321#discussion_r122359276 --- Diff: core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala --- @@ -214,7 +214,7 @@ class MasterSuite extends SparkFunSuite master.rpcEnv.setupEndpoint(Master.ENDPOINT_NAME, master) // Wait until Master recover from checkpoint data. eventually(timeout(5 seconds), interval(100 milliseconds)) { -master.idToApp.size should be(1) +master.workers.size should be(1) --- End diff -- can you explain more about why it may fail? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18301: [SPARK-21052][SQL] Add hash map metrics to join
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18301#discussion_r122359244 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala --- @@ -74,6 +80,19 @@ object SQLMetrics { private val TIMING_METRIC = "timing" private val AVERAGE_METRIC = "average" + private val baseForAvgMetric: Int = 10 --- End diff -- Yeah, that's why I record the ceil of average number at the beginning. cc @rxin What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18025: [SPARK-20889][SparkR] Grouped documentation for AGGREGAT...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/18025 @felixcheung Your comments are all addressed now. Please let me know if there is anything else needed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17758: [SPARK-20460][SQL] Make it more consistent to handle col...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17758 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78142/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17758: [SPARK-20460][SQL] Make it more consistent to handle col...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17758 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18301: [SPARK-21052][SQL] Add hash map metrics to join
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18301#discussion_r122359108 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala --- @@ -74,6 +80,19 @@ object SQLMetrics { private val TIMING_METRIC = "timing" private val AVERAGE_METRIC = "average" + private val baseForAvgMetric: Int = 10 --- End diff -- I'm not quite sure this hack worth. For small number of probes, we don't care the values I think. For large number of probes, having one more digit in the fraction part is not very useful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18320 Yes, there is still the issue and this only fixes (avoid) the test failure. I believe running the codes should reproduce the issue for both Mac and CentOS. What I don't get it, when the number of fork executions is not many, this is not reproduced (even sometimes increasing pipes were not observed sometimes). The number of the pipes decrease in a certain condition (it did not look related with time but some events). The issue with `gapply` is exposed and found now as it invokes many forks via `daemon.R` but I guess this issue might still exist for all other APIs executing R native function with this daemon. I gave a shot to resolve the root cause within `daemon.R` with several tries but I could not make it. Root cause is: With a terminal executing `watch -n 0.01 "lsof -c R | wc -l"` With another terminal: ```r for(i in 0:200) { p <- parallel:::mcfork() if (inherits(p, "masterProcess")) { tools::pskill(Sys.getpid(), tools::SIGUSR1) parallel:::mcexit(0L) } } ``` The number of opened pipes just keep increasing. I double checked the processes and sockets are closed via `netstats` and `ps`. We need to resolve this one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17758: [SPARK-20460][SQL] Make it more consistent to handle col...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17758 **[Test build #78142 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78142/testReport)** for PR 17758 at commit [`44f3d35`](https://github.com/apache/spark/commit/44f3d35fc947b845b60a527723a0c5aabf991145). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18025: [SPARK-20889][SparkR] Grouped documentation for AGGREGAT...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18025 **[Test build #78152 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78152/testReport)** for PR 18025 at commit [`0a7f5fc`](https://github.com/apache/spark/commit/0a7f5fcac2e0295d92b82d8909c4f1b11c82f016). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18092: [SPARK-20640][CORE]Make rpc timeout and retry for shuffl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18092 **[Test build #78151 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78151/testReport)** for PR 18092 at commit [`d01134e`](https://github.com/apache/spark/commit/d01134ef92401a5275c7388c8e6d65c82785acfa). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18025: [SPARK-20889][SparkR] Grouped documentation for A...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/18025#discussion_r122358689 --- Diff: R/pkg/R/generics.R --- @@ -919,10 +920,9 @@ setGeneric("array_contains", function(x, value) { standardGeneric("array_contain #' @export setGeneric("ascii", function(x) { standardGeneric("ascii") }) -#' @param x Column to compute on or a GroupedData object. --- End diff -- In this case, we will have to document `avg` on its own, like `count`, `first` and `last`. I cannot document the `x` param here since it will show up in the doc for the column class. Interestingly, there is not even a doc of the `avg` method from the `GroupedData` class --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18308: [SPARK-21099][Spark Core] INFO Log Message Using Incorre...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/18308 > I wonder if whether executor is completely gone or whether executor is still there but has no cached RDD, if both scenarios return false. Yes, that's the case, we cannot differentiate this two scenarios. But I think it is fine, since it is just a log issue and hard for us to differentiate them in the current code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/18320 thx - I think more importantly, does the issue manifest when someone manually call gapply in a similar way on RHEL/CentOS? We could workaround the test failure, but if user can use into this in normal use then we need to address this within gapply --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18320 **[Test build #78148 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78148/testReport)** for PR 18320 at commit [`52c8abf`](https://github.com/apache/spark/commit/52c8abf9551e126f75ef0aa0a042f1ebd13e8d47). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18320 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78148/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18320 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18321: [SPARK-12552][FOLLOWUP] Fix flaky test for "o.a.s.deploy...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18321 **[Test build #78150 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78150/testReport)** for PR 18321 at commit [`55c5d12`](https://github.com/apache/spark/commit/55c5d12023dec1cbf2e8aa6b4507c49c3df5b322). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18320 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18320 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78147/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18320 **[Test build #78147 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78147/testReport)** for PR 18320 at commit [`505d75f`](https://github.com/apache/spark/commit/505d75f0e9a90481f96d0f1fefd4f9baaa38ee7d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18321: [SPARK-12552][FOLLOWUP] Fix flaky test for "o.a.s...
GitHub user jerryshao opened a pull request: https://github.com/apache/spark/pull/18321 [SPARK-12552][FOLLOWUP] Fix flaky test for "o.a.s.deploy.master.MasterSuite.master correctly recover the application" ## What changes were proposed in this pull request? Due to the RPC asynchronous event processing, The test "correctly recover the application" could potentially be failed. The issue could be found in here: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78126/testReport/org.apache.spark.deploy.master/MasterSuite/master_correctly_recover_the_application/. So here fixing this flaky test. ## How was this patch tested? Existing UT. CC @cloud-fan @jiangxb1987 , please help to review, thanks! You can merge this pull request into a Git repository by running: $ git pull https://github.com/jerryshao/apache-spark SPARK-12552-followup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18321.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18321 commit 55c5d12023dec1cbf2e8aa6b4507c49c3df5b322 Author: jerryshaoDate: 2017-06-16T03:10:48Z Fix flaky test Change-Id: I20f1a68b682cbfda05be319b365495c80fb4cda4 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18025: [SPARK-20889][SparkR] Grouped documentation for AGGREGAT...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18025 **[Test build #78149 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78149/testReport)** for PR 18025 at commit [`014b9f3`](https://github.com/apache/spark/commit/014b9f3069a6e2075cb8be307c5d74081dabe15a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18075#discussion_r122356960 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -233,10 +222,118 @@ class CodegenContext { // The collection of sub-expression result resetting methods that need to be called on each row. val subexprFunctions = mutable.ArrayBuffer.empty[String] - def declareAddedFunctions(): String = { -addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n") + val outerClassName = "OuterClass" + + /** + * Holds the class and instance names to be generated, where `OuterClass` is a placeholder + * standing for whichever class is generated as the outermost class and which will contain any + * nested sub-classes. All other classes and instance names in this list will represent private, + * nested sub-classes. + */ + private val classes: mutable.ListBuffer[(String, String)] = +mutable.ListBuffer[(String, String)](outerClassName -> null) + + // A map holding the current size in bytes of each class to be generated. + private val classSize: mutable.Map[String, Int] = +mutable.Map[String, Int](outerClassName -> 0) + + // Nested maps holding function names and their code belonging to each class. + private val classFunctions: mutable.Map[String, mutable.Map[String, String]] = +mutable.Map(outerClassName -> mutable.Map.empty[String, String]) + + // Returns the size of the most recently added class. + private def currClassSize(): Int = classSize(classes.head._1) + + // Returns the class name and instance name for the most recently added class. + private def currClass(): (String, String) = classes.head + + // Adds a new class. Requires the class' name, and its instance name. + private def addClass(className: String, classInstance: String): Unit = { +classes.prepend(className -> classInstance) +classSize += className -> 0 +classFunctions += className -> mutable.Map.empty[String, String] + } + + /** + * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the + * function will be inlined into a new private, nested class, and a class-qualified name for the --- End diff -- nit: class instance-qualified name --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18075#discussion_r122356982 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -233,10 +222,118 @@ class CodegenContext { // The collection of sub-expression result resetting methods that need to be called on each row. val subexprFunctions = mutable.ArrayBuffer.empty[String] - def declareAddedFunctions(): String = { -addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n") + val outerClassName = "OuterClass" + + /** + * Holds the class and instance names to be generated, where `OuterClass` is a placeholder + * standing for whichever class is generated as the outermost class and which will contain any + * nested sub-classes. All other classes and instance names in this list will represent private, + * nested sub-classes. + */ + private val classes: mutable.ListBuffer[(String, String)] = +mutable.ListBuffer[(String, String)](outerClassName -> null) + + // A map holding the current size in bytes of each class to be generated. + private val classSize: mutable.Map[String, Int] = +mutable.Map[String, Int](outerClassName -> 0) + + // Nested maps holding function names and their code belonging to each class. + private val classFunctions: mutable.Map[String, mutable.Map[String, String]] = +mutable.Map(outerClassName -> mutable.Map.empty[String, String]) + + // Returns the size of the most recently added class. + private def currClassSize(): Int = classSize(classes.head._1) + + // Returns the class name and instance name for the most recently added class. + private def currClass(): (String, String) = classes.head + + // Adds a new class. Requires the class' name, and its instance name. + private def addClass(className: String, classInstance: String): Unit = { +classes.prepend(className -> classInstance) +classSize += className -> 0 +classFunctions += className -> mutable.Map.empty[String, String] + } + + /** + * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the + * function will be inlined into a new private, nested class, and a class-qualified name for the + * function will be returned. Otherwise, the function will be inined to the `OuterClass` the + * simple `funcName` will be returned. + * + * @param funcName the class-unqualified name of the function + * @param funcCode the body of the function + * @param inlineToOuterClass whether the given code must be inlined to the `OuterClass`. This + * can be necessary when a function is declared outside of the context + * it is eventually referenced and a returned qualified function name + * cannot otherwise be accessed. + * @return the name of the function, qualified by class if it will be inlined to a private, --- End diff -- ditto. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18025: [SPARK-20889][SparkR] Grouped documentation for A...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/18025#discussion_r122356625 --- Diff: R/pkg/R/generics.R --- @@ -1403,20 +1416,25 @@ setGeneric("unix_timestamp", function(x, format) { standardGeneric("unix_timesta #' @export setGeneric("upper", function(x) { standardGeneric("upper") }) -#' @rdname var +#' @rdname column_aggregate_functions +#' @param y,na.rm,use currently not used. --- End diff -- Good point. Moved to `column_aggregate_functions`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18075#discussion_r122356214 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -233,10 +222,124 @@ class CodegenContext { // The collection of sub-expression result resetting methods that need to be called on each row. val subexprFunctions = mutable.ArrayBuffer.empty[String] - def declareAddedFunctions(): String = { -addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n") + /** + * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for + * whichever class is generated as the outermost class and which will contain any nested + * sub-classes. All other classes and instance names in this list will represent private, nested + * sub-classes. + */ + private val classes: mutable.ListBuffer[(String, String)] = +mutable.ListBuffer[(String, String)]("OuterClass" -> null) + + // A map holding the current size in bytes of each class to be generated. + private val classSize: mutable.Map[String, Int] = +mutable.Map[String, Int]("OuterClass" -> 0) + + // Nested maps holding function names and their code belonging to each class. + private val classFunctions: mutable.Map[String, mutable.Map[String, String]] = +mutable.Map("OuterClass" -> mutable.Map.empty[String, String]) + + // Returns the size of the most recently added class. + private def currClassSize(): Int = classSize(classes.head._1) + + // Returns the class name and instance name for the most recently added class. + private def currClass(): (String, String) = classes.head + + // Adds a new class. Requires the class' name, and its instance name. + private def addClass(className: String, classInstance: String): Unit = { +classes.prepend(className -> classInstance) +classSize += className -> 0 +classFunctions += className -> mutable.Map.empty[String, String] } + /** + * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the + * function will be inlined into a new private, nested class, and a class-qualified name for the + * function will be returned. Otherwise, the function will be inined to the `OuterClass` the + * simple `funcName` will be returned. + * + * @param funcName the class-unqualified name of the function + * @param funcCode the body of the function + * @param inlineToOuterClass whether the given code must be inlined to the `OuterClass`. This --- End diff -- It seems to me, as the `stopEarly` in `Limit` is going to override the `stopEarly` in `BufferedRowIterator`, we can only put it in outer class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18320 **[Test build #78148 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78148/testReport)** for PR 18320 at commit [`52c8abf`](https://github.com/apache/spark/commit/52c8abf9551e126f75ef0aa0a042f1ebd13e8d47). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18320 Yes, I believe you are correct and the daemon is already running but it avoids to use the problematic daemon - https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L363-L392 up to my knowledge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18320 Yes, I believe you are correct but the daemon is already running but it avoids to use the problematic daemon - https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L363-L392 up to my knowledge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/18320 Hmm I'm not sure - I'm pretty sure the session / spark context is already initialized when this test is run and changing the setting here does it affect the existing daemon process already running? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18320 **[Test build #78147 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78147/testReport)** for PR 18320 at commit [`505d75f`](https://github.com/apache/spark/commit/505d75f0e9a90481f96d0f1fefd4f9baaa38ee7d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17702: [SPARK-20408][SQL] Get the glob path in parallel ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17702#discussion_r122354359 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -389,6 +389,23 @@ case class DataSource( } /** + * Return all paths represented by the wildcard string. + */ + private def getGlobbedPaths(qualified: Path): Seq[Path] = { --- End diff -- at least we should follow `InMemoryFileIndex.bulkListLeafFiles` and `Picks the listing strategy adaptively depending on the number of paths to list` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/ga...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18320 cc @felixcheung, @shivaram and @MLnick. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18320: [SPARK-21093][R] Avoid mcfork in R's daemon in ga...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/18320 [SPARK-21093][R] Avoid mcfork in R's daemon in gapply/gapplyCollect tests ## What changes were proposed in this pull request? `mcfork` in R looks opening a pipe ahead but the existing logic does not properly close it when it is executed hot. This leads to the failure of more forking due to the limit for number of files open. This hot execution looks particularly for `gapply`/`gapplyCollect`. For unknown reason, this happens more easily in CentOS and could be reproduced in Mac too. All the details are described in https://issues.apache.org/jira/browse/SPARK-21093 This PR proposes simply to avoid reusing that daemon but each process from JVM that look terminating all correctly. ## How was this patch tested? I ran the codes below on both CentOS and Mac. ```r df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d")) collect(gapply(df, "a", function(key, x) { x }, schema(df))) collect(gapply(df, "a", function(key, x) { x }, schema(df))) ... # 30 times ``` Also, now it passes R tests on CentOS as below: ``` SparkSQL functions: Spark package found in SPARK_HOME: .../spark .. .. .. .. .. ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-21093 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18320.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18320 commit 505d75f0e9a90481f96d0f1fefd4f9baaa38ee7d Author: hyukjinkwonDate: 2017-06-16T02:37:53Z Avoid mcfork in R's daemon in gapply tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17702 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org