[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1450#issuecomment-49261722 Eh the binary checker is really failing me. Is there a way to disable binary checker for inner functions? @pwendell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1447#discussion_r15042631 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = { val map = new JHashMap[K, V] - iter.foreach { case (k, v) = -val old = map.get(k) -map.put(k, if (old == null) v else func(old, v)) + iter.foreach { pair = +val old = map.get(pair._1) --- End diff -- This I'm sure has been done a million times, but one more for good luck -- here is a `map { case (x, y) = x + y}` versus `foreach { xy = xy._1 + xy._2}`: http://www.diffchecker.com/vtv6cptx No new virtual function calls, but one new branch (trivially branch predictable) and one new throw (which will never be invoked), and several store/loads. (Perhaps I'm misreading the bytecode, but isn't `astore_2` followed by `aload_2` a no-op?) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1447#discussion_r15042857 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = { val map = new JHashMap[K, V] - iter.foreach { case (k, v) = -val old = map.get(k) -map.put(k, if (old == null) v else func(old, v)) + iter.foreach { pair = +val old = map.get(pair._1) --- End diff -- I think we tested this in the past and saw a difference, perhaps because it can now throw. But I don't remember the details. Weird that it's also doing those stores. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1447#discussion_r15042897 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = { val map = new JHashMap[K, V] - iter.foreach { case (k, v) = -val old = map.get(k) -map.put(k, if (old == null) v else func(old, v)) + iter.foreach { pair = +val old = map.get(pair._1) --- End diff -- There are all kinds of stuff that's really hard to profile at runtime. For example, the JVM might decide not to JIT the code because it is longer. It might decide not to do it because of exceptions. Branch prediction might stop working because it has too many branches. Etc. Again, I'm only in favor of this change because it is small, and it might have impact. If it is a big ugly change, I would be against it without profiling. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1447#discussion_r15042905 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -571,12 +571,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) throw new SparkException(Default partitioner cannot partition array keys.) } val cg = new CoGroupedRDD[K](Seq(self, other1, other2, other3), partitioner) -cg.mapValues { case Seq(vs, w1s, w2s, w3s) = - (vs.asInstanceOf[Seq[V]], -w1s.asInstanceOf[Seq[W1]], -w2s.asInstanceOf[Seq[W2]], -w3s.asInstanceOf[Seq[W3]]) -} +cg.mapValues { seq = seq.asInstanceOf[(Seq[V], Seq[W1], Seq[W2], Seq[W3])] } --- End diff -- And yeah as Reynold pointed out, this won't actually cast. A Seq is not a tuple4, you need to make a new one. In this case it might be okay to leave the code as is, because the non-pattern-matching way will be hairier. Basically you'd have to do `seq = (seq(0).asInstanceOf[Seq[V]], seq(1).asInstanceOf[Seq[W1]], etc)`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [branch-0.9] Fix github links in docs
GitHub user mengxr opened a pull request: https://github.com/apache/spark/pull/1456 [branch-0.9] Fix github links in docs We moved example code in v1.0. The links are no longer valid if still pointing to `tree/master`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mengxr/spark links-0.9 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1456.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1456 commit b7b926092bd13f4f6adf9ab9871c3e086309ed23 Author: Xiangrui Meng m...@databricks.com Date: 2014-07-17T06:33:23Z change tree/master to tree/branch-0.9 in docs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2299] Consolidate various stageIdTo* ha...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1262#issuecomment-49263949 I pushed a new version. I'd first merge this and then have a separate PR to index the hash table by stageId + attempt. Now it includes @kayousterhout's change. Please take another look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Required AM memory is amMem, not args.amMem...
GitHub user maji2014 opened a pull request: https://github.com/apache/spark/pull/1457 Required AM memory is amMem, not args.amMemory ERROR yarn.Client: Required AM memory (1024) is above the max threshold (1048) of this cluster appears if this code is not changed. obviously, 1024 is less than 1048, so change this You can merge this pull request into a Git repository by running: $ git pull https://github.com/maji2014/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1457.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1457 commit 2cc1af6e69fa1dc1a76383e44d9b5389d4e133af Author: maji2014 ma...@asiainfo-linkage.com Date: 2014-06-08T11:19:01Z Update run-example Old code can only be ran under spark_home and use bin/run-example. Error ./run-example: line 55: ./bin/spark-submit: No such file or directory appears when running in other place. So change this commit 9975f13235c4e4f325df8419878b0977db533e5f Author: derek ma ma...@asiainfo-linkage.com Date: 2014-07-17T06:35:30Z Required AM memory is amMem, not args.amMemory ERROR yarn.Client: Required AM memory (1024) is above the max threshold (1048) of this cluster appears if this code is not changed. obviously, 1024 is less than 1048. So change this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [branch-0.9] Fix github links in docs
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1456#issuecomment-49263937 This PR only contains changes in docs and the links were verified using linkchecker. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [branch-0.9] Fix github links in docs
Github user mengxr closed the pull request at: https://github.com/apache/spark/pull/1456 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [branch-0.9] Fix github links in docs
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1456#issuecomment-49264105 Merged into branch-0.9. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Required AM memory is amMem, not args.amMem...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1457#issuecomment-49264180 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2299] Consolidate various stageIdTo* ha...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1262#issuecomment-49264275 QA tests have started for PR 1262. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16768/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Required AM memory is amMem, not args.amMem...
Github user maji2014 commented on the pull request: https://github.com/apache/spark/pull/1457#issuecomment-49264699 Please focus on the second issue as the first issue is a old patch on June. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user lirui-intel commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-49264857 If a TaskSet only contains no-pref tasks, there won't be delay because the only valid level is ANY, so everything gets scheduled right away. If a TaskSet contains process-local + no-pref tasks (I suppose it should be process-local + node-local + no-pref actually since process-local task is also node-local?), there'll be delay on the no-pref tasks, even when all the process-local tasks have been launched. This is because valid levels are not re-computed as task finishes. Not sure if there'll be too much overhead to do so... (it should also be re-computed when executor is lost) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [branch-0.9] bump versions for v0.9.2 release ...
GitHub user mengxr opened a pull request: https://github.com/apache/spark/pull/1458 [branch-0.9] bump versions for v0.9.2 release candidate Manually update some version numbers. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mengxr/spark v0.9.2-rc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1458.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1458 commit bc870352f52ad4016a3f21f9f21cc37047536b53 Author: Xiangrui Meng m...@databricks.com Date: 2014-07-17T05:36:48Z bump version numbers to 0.9.2 commit 162af66ad32d98e6a7cfc785af65f85930b63d6e Author: Xiangrui Meng m...@databricks.com Date: 2014-07-17T05:37:08Z Merge remote-tracking branch 'apache/branch-0.9' into v0.9.2-rc commit ea2b20576d829fe66c689a08fdcca9851fcc868d Author: Xiangrui Meng m...@databricks.com Date: 2014-07-17T05:42:30Z update version in SparkBuild commit 7d0fb769cdb4e0b17f4ed22689d609a3a460c4e8 Author: Xiangrui Meng m...@databricks.com Date: 2014-07-17T06:33:23Z change tree/master to tree/branch-0.9 in docs commit 2c38419bc02e4f4a830f04da2509f1d73ec7aa83 Author: Xiangrui Meng m...@databricks.com Date: 2014-07-17T06:41:08Z Merge remote-tracking branch 'apache/branch-0.9' into v0.9.2-rc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [branch-0.9] bump versions for v0.9.2 release ...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1458#issuecomment-49265114 Merged into branch-0.9. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [branch-0.9] bump versions for v0.9.2 release ...
Github user mengxr closed the pull request at: https://github.com/apache/spark/pull/1458 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49265512 QA results for PR 1439:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brclass HadoopTableReader(@transient attributes: Seq[Attribute], brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16766/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: update CHANGES.txt
GitHub user mengxr opened a pull request: https://github.com/apache/spark/pull/1459 update CHANGES.txt You can merge this pull request into a Git repository by running: $ git pull https://github.com/mengxr/spark v0.9.2-rc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1459.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1459 commit 6fa5a656281dd5df5cf7f72660efd89fe7b8ec8d Author: Xiangrui Meng m...@databricks.com Date: 2014-07-17T06:59:57Z update CHANGES.txt --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1450#discussion_r15043414 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -214,7 +214,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) throw new SparkException(reduceByKeyLocally() does not support array keys) } -def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = { +val reducePartition = (iter: Iterator[(K, V)]) = { --- End diff -- this is fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [branch-0.9] Update CHANGES.txt
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1459#issuecomment-49266193 QA tests have started for PR 1459. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16769/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1450#issuecomment-49266828 QA tests have started for PR 1450. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16770/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/1447#discussion_r15043849 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -571,12 +571,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) throw new SparkException(Default partitioner cannot partition array keys.) } val cg = new CoGroupedRDD[K](Seq(self, other1, other2, other3), partitioner) -cg.mapValues { case Seq(vs, w1s, w2s, w3s) = - (vs.asInstanceOf[Seq[V]], -w1s.asInstanceOf[Seq[W1]], -w2s.asInstanceOf[Seq[W2]], -w3s.asInstanceOf[Seq[W3]]) -} +cg.mapValues { seq = seq.asInstanceOf[(Seq[V], Seq[W1], Seq[W2], Seq[W3])] } --- End diff -- Ah, right. I'll leave these as they are for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/1447#discussion_r15043860 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = { val map = new JHashMap[K, V] - iter.foreach { case (k, v) = -val old = map.get(k) -map.put(k, if (old == null) v else func(old, v)) + iter.foreach { pair = +val old = map.get(pair._1) --- End diff -- Also on removing the calls to _1, these values should be in cache, so accesses will be really fast. Will hold off on these for now, but happy to make the change if y'all want. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/1447#discussion_r15044028 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -712,8 +701,8 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) val index = p.getPartition(key) def process(it: Iterator[(K, V)]): Seq[V] = { val buf = new ArrayBuffer[V] - for ((k, v) - it if k == key) { --- End diff -- Wait, actually, my understanding of this loop is that it's iterating over every record within the partition. Am I missing something? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1447#discussion_r15044062 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -712,8 +701,8 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) val index = p.getPartition(key) def process(it: Iterator[(K, V)]): Seq[V] = { val buf = new ArrayBuffer[V] - for ((k, v) - it if k == key) { --- End diff -- Actually yes my bad :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [branch-0.9] Update CHANGES.txt
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1459#issuecomment-49267990 QA results for PR 1459:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16769/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1450#issuecomment-49269144 I created a JIRA to deal with this and did some initial exploration, but I think I'll need to wait for Prashant to actually do it: https://issues.apache.org/jira/browse/SPARK-2549 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2412] CoalescedRDD throws exception wit...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1337#issuecomment-49270261 Okay I merged this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2412] CoalescedRDD throws exception wit...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1337 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2526: Simplify options in make-distribut...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1445 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2410][SQL][WIP] Cherry picked Hive Thri...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1399#discussion_r15045195 --- Diff: sbin/start-thriftserver.sh --- @@ -0,0 +1,24 @@ +#!/usr/bin/env bash + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Figure out where Spark is installed +FWDIR=$(cd `dirname $0`/..; pwd) + +CLASS=org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 +$FWDIR/bin/spark-class $CLASS $@ --- End diff -- Checkout `spark-shell` for an example of a user-facing script that triages options to `spark-submit`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2423] Clean up SparkSubmit for readabil...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1349#issuecomment-49271133 Thanks Andrew, looks good! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/1460 [SPARK-2538] [PySpark] Hash based disk spilling aggregation During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation. It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition). You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark spill Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1460.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1460 commit f933713ed628779309fab0da76045f8750d6b350 Author: Davies Liu davies@gmail.com Date: 2014-07-17T08:03:32Z Hash based disk spilling aggregation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2423] Clean up SparkSubmit for readabil...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1349 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [branch-0.9] Update CHANGES.txt
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1459#issuecomment-49271772 Merged into branch-0.9. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [branch-0.9] Update CHANGES.txt
Github user mengxr closed the pull request at: https://github.com/apache/spark/pull/1459 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2299] Consolidate various stageIdTo* ha...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1262#issuecomment-49271814 QA results for PR 1262:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16768/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49271938 QA tests have started for PR 1460. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16772/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1361#issuecomment-49272156 Jenkins, add to whitelist. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1361#issuecomment-49272169 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1447#issuecomment-49272425 QA tests have started for PR 1447. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16773/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1361#issuecomment-49272423 QA tests have started for PR 1361. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16774/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2553. CoGroupedRDD unnecessarily allocat...
GitHub user sryza opened a pull request: https://github.com/apache/spark/pull/1461 SPARK-2553. CoGroupedRDD unnecessarily allocates a Tuple2 per dependency... ... per key My humble opinion is that avoiding allocations in this performance-critical section is worth the extra code. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sryza/spark sandy-spark-2553 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1461.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1461 commit 7eaf7f2a18ddea7ee47aacb5c5559c278a924899 Author: Sandy Ryza sa...@cloudera.com Date: 2014-07-17T08:19:48Z SPARK-2553. CoGroupedRDD unnecessarily allocates a Tuple2 per dependency per key --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2555] Support configuration spark.sched...
Github user li-zhihui commented on the pull request: https://github.com/apache/spark/pull/1462#issuecomment-49284642 @tgravescs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49286439 QA results for PR 1460:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brclass Merger(object):brclass AutoSerializer(FramedSerializer):brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16772/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1361#issuecomment-49287245 QA results for PR 1361:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16774/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1447#issuecomment-49287721 QA results for PR 1447:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16773/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2125] Add sort flag and move sort into ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1210#issuecomment-49293649 QA results for PR 1210:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16776/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2497 Exclude companion classes, with the...
GitHub user ScrapCodes opened a pull request: https://github.com/apache/spark/pull/1463 SPARK-2497 Exclude companion classes, with their corresponding objects. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ScrapCodes/spark-1 SPARK-2497/mima-exclude-all Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1463.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1463 commit 11eff8f7fe2a5dcd3bd6b0ed38c51c9f914e96c3 Author: Prashant Sharma prashan...@imaginea.com Date: 2014-07-17T12:48:49Z SPARK-2497 Exclude companion classes, with their corresponding objects. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2497 Exclude companion classes, with the...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1463#issuecomment-49302217 QA tests have started for PR 1463. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16777/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2460] Optimize SparkContext.hadoopFile ...
Github user lianhuiwang commented on a diff in the pull request: https://github.com/apache/spark/pull/1385#discussion_r15055059 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -128,25 +123,13 @@ class HadoopRDD[K, V]( // Returns a JobConf that will be used on slaves to obtain input splits for Hadoop reads. protected def getJobConf(): JobConf = { -val conf: Configuration = broadcastedConf.value.value -if (conf.isInstanceOf[JobConf]) { - // A user-broadcasted JobConf was provided to the HadoopRDD, so always use it. - conf.asInstanceOf[JobConf] -} else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) { +val conf: JobConf = broadcastedConf.value.value +if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) { --- End diff -- yes,i agree with you. broadcastedConf is cached by blockManager in Broadcast --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2481: The environment variables SPARK_HI...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1341#issuecomment-49303258 QA tests have started for PR 1341. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16778/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]When the executor is thrown OutOfMemoryEr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1387#issuecomment-49303783 QA tests have started for PR 1387. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16779/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2497 Exclude companion classes, with the...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1463#issuecomment-49314286 QA results for PR 1463:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16777/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2481: The environment variables SPARK_HI...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1341#issuecomment-49316014 QA results for PR 1341:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16778/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: The driver perform garbage collection, when th...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1387#issuecomment-49316425 QA results for PR 1387:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16779/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1425#issuecomment-49318202 QA tests have started for PR 1425. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16780/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1425#issuecomment-49321741 QA tests have started for PR 1425. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16781/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Spark 2557] fix LOCAL_N_REGEX in createTaskSc...
GitHub user advancedxy opened a pull request: https://github.com/apache/spark/pull/1464 [Spark 2557] fix LOCAL_N_REGEX in createTaskScheduler and make local-n and local-n-failures consistent [SPARK-2557](https://issues.apache.org/jira/browse/SPARK-2557) You can merge this pull request into a Git repository by running: $ git pull https://github.com/advancedxy/spark SPARK-2557 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1464.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1464 commit 3bbc668a9994a76683926902131ab06a90d96a85 Author: Ye Xianjin advance...@gmail.com Date: 2014-07-17T15:12:00Z fix LOCAL_N_REGEX regular expression and make local_n_failures accept * as all cores on the computer commit d844d67120edcc31409fabf59328eb8a651bbad3 Author: Ye Xianjin advance...@gmail.com Date: 2014-07-17T15:21:16Z add local-*-n-failures, bad-local-n, bad-local-n-failures test case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Spark 2557] fix LOCAL_N_REGEX in createTaskSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1464#issuecomment-49323035 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1439#discussion_r15067812 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala --- @@ -67,95 +61,12 @@ case class HiveTableScan( } @transient - private[this] val hadoopReader = new HadoopTableReader(relation.tableDesc, context) - - /** - * The hive object inspector for this table, which can be used to extract values from the - * serialized row representation. - */ - @transient - private[this] lazy val objectInspector = - relation.tableDesc.getDeserializer.getObjectInspector.asInstanceOf[StructObjectInspector] - - /** - * Functions that extract the requested attributes from the hive output. Partitioned values are - * casted from string to its declared data type. - */ - @transient - protected lazy val attributeFunctions: Seq[(Any, Array[String]) = Any] = { -attributes.map { a = - val ordinal = relation.partitionKeys.indexOf(a) - if (ordinal = 0) { -val dataType = relation.partitionKeys(ordinal).dataType -(_: Any, partitionKeys: Array[String]) = { - castFromString(partitionKeys(ordinal), dataType) -} - } else { -val ref = objectInspector.getAllStructFieldRefs - .find(_.getFieldName == a.name) - .getOrElse(sys.error(sCan't find attribute $a)) -val fieldObjectInspector = ref.getFieldObjectInspector - -val unwrapHiveData = fieldObjectInspector match { - case _: HiveVarcharObjectInspector = -(value: Any) = value.asInstanceOf[HiveVarchar].getValue - case _: HiveDecimalObjectInspector = -(value: Any) = BigDecimal(value.asInstanceOf[HiveDecimal].bigDecimalValue()) - case _ = -identity[Any] _ -} - -(row: Any, _: Array[String]) = { - val data = objectInspector.getStructFieldData(row, ref) - val hiveData = unwrapData(data, fieldObjectInspector) - if (hiveData != null) unwrapHiveData(hiveData) else null -} - } -} - } + private[this] val hadoopReader = new HadoopTableReader(attributes, relation, context) private[this] def castFromString(value: String, dataType: DataType) = { Cast(Literal(value), dataType).eval(null) } - private def addColumnMetadataToConf(hiveConf: HiveConf) { --- End diff -- I would keep it. It is important to set needed columns in conf. So, RCFile and ORC can know what columns should be skipped. Also, seems `hiveConf.set(serdeConstants.LIST_COLUMN_TYPES, columnTypeNames)` and `hiveConf.set(serdeConstants.LIST_COLUMNS, columnInternalNames)` will be used to push down filters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1439#discussion_r15068484 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -156,33 +158,43 @@ class HadoopTableReader(@transient _tableDesc: TableDesc, @transient sc: HiveCon } // Create local references so that the outer object isn't serialized. - val tableDesc = _tableDesc + val tableDesc = relation.tableDesc val broadcastedHiveConf = _broadcastedHiveConf val localDeserializer = partDeserializer + val mutableRow = new GenericMutableRow(attributes.length) + + // split the attributes (output schema) into 2 categories: + // (partition keys, ordinal), (normal attributes, ordinal), the ordinal mean the + // position of the in the output Row. --- End diff -- position of the attribute? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49334156 QA tests have started for PR 1460. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16782/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2083 Add support for spark.local.maxFail...
GitHub user kbzod opened a pull request: https://github.com/apache/spark/pull/1465 SPARK-2083 Add support for spark.local.maxFailures configuration property The logic in `SparkContext` for creating a new task scheduler now looks for a spark.local.maxFailures property to specify the number of task failures in a local job that will cause the job to fail. Its default is the prior fixed value of 1 (no retries). The patch includes documentation updates and new unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kbzod/spark SPARK-2083 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1465.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1465 commit a8b369dad154b7773ec8b535cf399b428ec3952c Author: Bill Havanki bhava...@cloudera.com Date: 2014-07-17T16:01:22Z SPARK-2083 Add support for spark.local.maxFailures configuration property --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2083 Add support for spark.local.maxFail...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1465#issuecomment-49335741 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2411] Add a history-not-found page to s...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/1336#issuecomment-49336964 I have updated the screenshots again. Anything else? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478.2 Fix incorrect NioServerSocketChan...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/1466 SPARK-1478.2 Fix incorrect NioServerSocketChannelFactory constructor call The line break inadvertently means this was interpreted as a call to the no-arg constructor. This doesn't exist in older Netty even. (Also fixed a val name typo.) You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-1478.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1466.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1466 commit 59c3501262d1358f5dc5580c19f2d30858cfcabc Author: Sean Owen sro...@gmail.com Date: 2014-07-17T17:23:48Z Line break caused Scala to interpret NioServerSocketChannelFactory constructor as the no-arg version, which is not even present in some versions of Netty --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2340] Resolve event logging and History...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1280#issuecomment-49337830 QA tests have started for PR 1280. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16784/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478.2 Fix incorrect NioServerSocketChan...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1466#issuecomment-49337826 QA tests have started for PR 1466. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16783/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49338675 I think we are not clear on the boundary between a `TableReader` and a physical `TableScan` operator (e.g. `HiveTableScan`). Seems we just want `TableReader` to create `RDD`s (general-purpose work) and inside a `TableScan` operator, we create Catalyst `Row`s (table-specific work). However, when we look at `HadoopTableReader`, it is actually a `HiveTableReader`. For every Hive partition, we create a `HadoopRDD` (requiring Hive-specific code) and deserialize Hive rows. I am not sure if `TableReader` is a good abstraction. I think it makes sense to remove the trait of `TableReader` and add a abstract `TableScan` class (inheriting `LeafNode`). All existing TableScan operators will inherit this abstract `TableScan` class. If we think it is the right approach. I can do it in another PR. @marmbrus, @liancheng, @rxin, @chenghao-intel thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15072735 --- Diff: python/pyspark/rdd.py --- @@ -168,6 +170,123 @@ def _replaceRoot(self, value): self._sink(1) +class Merger(object): + +External merger will dump the aggregated data into disks when memory usage is above +the limit, then merge them together. + + combiner = lambda x, y:x+y + merger = Merger(combiner, 10) + N = 1 + merger.merge(zip(xrange(N), xrange(N)) * 10) + merger.spills +100 + sum(1 for k,v in merger.iteritems()) +1 + + +PARTITIONS = 64 +BATCH = 1000 + +def __init__(self, combiner, memory_limit=256, path=/tmp/pyspark, serializer=None): --- End diff -- This should actually rotate among storage directories in spark.local.dir. Check out how the DiskStore works in Java. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15072768 --- Diff: python/pyspark/rdd.py --- @@ -168,6 +170,123 @@ def _replaceRoot(self, value): self._sink(1) +class Merger(object): + +External merger will dump the aggregated data into disks when memory usage is above +the limit, then merge them together. + + combiner = lambda x, y:x+y + merger = Merger(combiner, 10) + N = 1 + merger.merge(zip(xrange(N), xrange(N)) * 10) + merger.spills +100 + sum(1 for k,v in merger.iteritems()) +1 + + +PARTITIONS = 64 +BATCH = 1000 + +def __init__(self, combiner, memory_limit=256, path=/tmp/pyspark, serializer=None): +self.combiner = combiner +self.path = os.path.join(path, str(os.getpid())) +self.memory_limit = memory_limit +self.serializer = serializer or BatchedSerializer(AutoSerializer(), 1024) +self.item_limit = None +self.data = {} +self.pdata = [] +self.spills = 0 + +def used_memory(self): +rss = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss +if platform.system() == 'Linux': +rss = 10 +elif platform.system() == 'Darwin': +rss = 20 --- End diff -- We also need to make it work on Windows, or at least fail gracefully. Do you know how to get this number on Windows? You can spin up a Windows machine on EC2 to try it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15072860 --- Diff: python/pyspark/rdd.py --- @@ -168,6 +170,123 @@ def _replaceRoot(self, value): self._sink(1) +class Merger(object): + +External merger will dump the aggregated data into disks when memory usage is above +the limit, then merge them together. + + combiner = lambda x, y:x+y + merger = Merger(combiner, 10) + N = 1 + merger.merge(zip(xrange(N), xrange(N)) * 10) + merger.spills +100 + sum(1 for k,v in merger.iteritems()) +1 + + +PARTITIONS = 64 +BATCH = 1000 + +def __init__(self, combiner, memory_limit=256, path=/tmp/pyspark, serializer=None): +self.combiner = combiner +self.path = os.path.join(path, str(os.getpid())) +self.memory_limit = memory_limit +self.serializer = serializer or BatchedSerializer(AutoSerializer(), 1024) --- End diff -- Probably need to make the batch size an argument and pass in SparkContext._batchSize --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15072948 --- Diff: python/pyspark/rdd.py --- @@ -1247,15 +1366,12 @@ def combineLocally(iterator): return combiners.iteritems() locally_combined = self.mapPartitions(combineLocally) shuffled = locally_combined.partitionBy(numPartitions) - + +executorMemory = self.ctx._jsc.sc().executorMemory() def _mergeCombiners(iterator): -combiners = {} -for (k, v) in iterator: -if k not in combiners: -combiners[k] = v -else: -combiners[k] = mergeCombiners(combiners[k], v) -return combiners.iteritems() +merger = Merger(mergeCombiners, executorMemory * 0.7) --- End diff -- We probably want to use the same memory as the Java append only map, which is spark.shuffle.memoryFraction * executorMemory. Check how that gets created in ExternalAppendOnlyMap --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2411] Add a history-not-found page to s...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1336#discussion_r15072986 --- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala --- @@ -667,29 +664,47 @@ private[spark] class Master( */ def rebuildSparkUI(app: ApplicationInfo): Boolean = { val appName = app.desc.name -val eventLogDir = app.desc.eventLogDir.getOrElse { return false } +val eventLogDir = app.desc.eventLogDir.getOrElse { + // Event logging is not enabled for this application + app.desc.appUiUrl = /history/not-found + return false +} val fileSystem = Utils.getHadoopFileSystem(eventLogDir) val eventLogInfo = EventLoggingListener.parseLoggingInfo(eventLogDir, fileSystem) val eventLogPaths = eventLogInfo.logPaths val compressionCodec = eventLogInfo.compressionCodec -if (!eventLogPaths.isEmpty) { - try { -val replayBus = new ReplayListenerBus(eventLogPaths, fileSystem, compressionCodec) -val ui = new SparkUI( - new SparkConf, replayBus, appName + (completed), /history/ + app.id) -replayBus.replay() -app.desc.appUiUrl = ui.basePath -appIdToUI(app.id) = ui -webUi.attachSparkUI(ui) -return true - } catch { -case e: Exception = - logError(Exception in replaying log for application %s (%s).format(appName, app.id), e) - } -} else { - logWarning(Application %s (%s) has no valid logs: %s.format(appName, app.id, eventLogDir)) + +if (eventLogPaths.isEmpty) { --- End diff -- What happens here if `eventLogDir` doesn't exist? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15073011 --- Diff: python/pyspark/serializers.py --- @@ -297,6 +297,33 @@ class MarshalSerializer(FramedSerializer): loads = marshal.loads +class AutoSerializer(FramedSerializer): + +Choose marshal or cPickle as serialization protocol autumatically + +def __init__(self): +FramedSerializer.__init__(self) +self._type = None + +def dumps(self, obj): +try: +if self._type is not None: +raise TypeError(fallback) +return 'M' + marshal.dumps(obj) +except Exception: +self._type = 'P' +return 'P' + cPickle.dumps(obj, -1) --- End diff -- If the objects are not marshal-able but are pickle-able, is there a big performance cost to throwing an exception on each write? Would be good to test this, because if not, we can make this serializer our default where we now use Pickle. Even if there is a cost maybe we can do something where if 10% of the objects written fail to marshal we switch to always using pickle. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49341477 @chenghao-intel explained the root cause in https://issues.apache.org/jira/browse/SPARK-2523. Basically, we should use partition-specific `ObjectInspectors` to extract fields instead of using the `ObjectInspectors` set in the `TableDesc`. If partitions of a table are using different SerDe, their `ObjectInspector`s will be different. @chenghao-intel can you add unit tests? Is there any Hive query test that can be included? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15073144 --- Diff: python/pyspark/serializers.py --- @@ -297,6 +297,33 @@ class MarshalSerializer(FramedSerializer): loads = marshal.loads +class AutoSerializer(FramedSerializer): + +Choose marshal or cPickle as serialization protocol autumatically + +def __init__(self): +FramedSerializer.__init__(self) +self._type = None + +def dumps(self, obj): +try: +if self._type is not None: +raise TypeError(fallback) +return 'M' + marshal.dumps(obj) +except Exception: +self._type = 'P' +return 'P' + cPickle.dumps(obj, -1) + +def loads(self, stream): +_type = stream[0] +if _type == 'M': +return marshal.loads(stream[1:]) +elif _type == 'P': +return cPickle.loads(stream[1:]) --- End diff -- Seems to be more efficient to read a byte with stream.read(1) than to index into the stream, but I'm not sure --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2497 Exclude companion classes, with the...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1463#issuecomment-49341909 @ScrapCodes could you explain a bit more how this fixes SPARK-2497. If I look at the original false-positive the issue reported was not with a companion class. It was actually in a method of the object itself. Also the method was annotated, not the object. ``` Source: https://github.com/apache/spark/pull/886/files#diff-0f907b47af6261abe00eb31097a9493bR41 Error: [error] * method calculate(Double,Double)Double in object org.apache.spark.mllib.tree.impurity.Gini's type has changed; was (Double,Double)Double, is now: (Array[Double],Double)Double [error]filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem](org.apache.spark.mllib.tree.impurity.Gini.calculate) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15073412 --- Diff: python/pyspark/rdd.py --- @@ -168,6 +170,123 @@ def _replaceRoot(self, value): self._sink(1) +class Merger(object): + +External merger will dump the aggregated data into disks when memory usage is above +the limit, then merge them together. + + combiner = lambda x, y:x+y + merger = Merger(combiner, 10) + N = 1 + merger.merge(zip(xrange(N), xrange(N)) * 10) + merger.spills +100 + sum(1 for k,v in merger.iteritems()) +1 + + +PARTITIONS = 64 +BATCH = 1000 + +def __init__(self, combiner, memory_limit=256, path=/tmp/pyspark, serializer=None): +self.combiner = combiner +self.path = os.path.join(path, str(os.getpid())) +self.memory_limit = memory_limit +self.serializer = serializer or BatchedSerializer(AutoSerializer(), 1024) +self.item_limit = None +self.data = {} +self.pdata = [] +self.spills = 0 + +def used_memory(self): +rss = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss +if platform.system() == 'Linux': +rss = 10 +elif platform.system() == 'Darwin': +rss = 20 +return rss + +def merge(self, iterator): +iterator = iter(iterator) +d = self.data +comb = self.combiner +c = 0 +for k, v in iterator: +if k in d: +d[k] = comb(d[k], v) +else: +d[k] = v + +if self.item_limit is not None: +continue + +c += 1 +if c % self.BATCH == 0 and self.used_memory() self.memory_limit: +self.item_limit = c +self._first_spill() +self._partitioned_merge(iterator) +return + +def _partitioned_merge(self, iterator): +comb = self.combiner +c = 0 +for k, v in iterator: +d = self.pdata[hash(k) % self.PARTITIONS] +if k in d: +d[k] = comb(d[k], v) +else: +d[k] = v +c += 1 +if c = self.item_limit: +self._spill() +c = 0 + +def _first_spill(self): +path = os.path.join(self.path, str(self.spills)) +if not os.path.exists(path): +os.makedirs(path) +streams = [open(os.path.join(path, str(i)), 'w') + for i in range(self.PARTITIONS)] --- End diff -- In the future we might want to compress these with GzipFile, but for now just add a TODO --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1215 [MLLIB]: Clustering: Index out of b...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/1407#discussion_r15073983 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LocalKMeans.scala --- @@ -59,6 +59,11 @@ private[mllib] object LocalKMeans extends Logging { cumulativeScore += weights(j) * KMeans.pointCost(curCenters, points(j)) j += 1 } + if (j == 0) { +logWarning(kMeansPlusPlus initialization ran out of distinct points for centers. + + s Using duplicate point for center k = $i.) +j = 1 --- End diff -- I'll go with the second suggestion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2542] Exit Code Class should be renamed...
GitHub user sarutak opened a pull request: https://github.com/apache/spark/pull/1467 [SPARK-2542] Exit Code Class should be renamed and placed package properly You can merge this pull request into a Git repository by running: $ git pull https://github.com/sarutak/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1467.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1467 commit 01acd8e9033045d9d5f9d5c230c427a007535cdb Author: Kousuke Saruta saru...@oss.nttdata.co.jp Date: 2014-07-17T18:11:34Z Renamed and repackaged ExecutorExitCode.scala properly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2542] Exit Code Class should be renamed...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1467#issuecomment-49343702 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]
Github user freeman-lab commented on the pull request: https://github.com/apache/spark/pull/1361#issuecomment-49344174 Looks like the basic test for correct final params passes, but not the stricter test for improvement on every update. Both pass locally. My guess is that it's running a bit slower on Jenkins, so the updates don't complete fast enough (I can create a failure locally by making the test data rate too high). I'll play with this, might work to just slow down the data rate. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49346100 QA results for PR 1460:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brclass Merger(object):brclass AutoSerializer(FramedSerializer):brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16782/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1215 [MLLIB]: Clustering: Index out of b...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1407#issuecomment-49347076 QA tests have started for PR 1407. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16785/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1215 [MLLIB]: Clustering: Index out of b...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1407#issuecomment-49347163 QA results for PR 1407:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16785/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/1362#discussion_r15076386 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1107,7 +1106,6 @@ class DAGScheduler( case shufDep: ShuffleDependency[_, _, _] = val mapStage = getShuffleMapStage(shufDep, stage.jobId) if (!mapStage.isAvailable) { -visitedStages += mapStage --- End diff -- @mateiz It looks to me that this is what was always intended, but has actually been missing the `visitedStages` check for the past couple of years: ```scala if (!mapStage.isAvailable !visitedStages(mapStage)) { visitedStages += mapStage visit(mapStage.rdd) } // Otherwise there's no need to follow the dependency back ``` Making that change works for me in the sense that all of the test suites continue to pass, but I haven't yet got any relative performance numbers. Proabably needs to go in a separate JIRA PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user nishkamravi2 commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-49348179 Bringing the discussion back online. Thanks for all the input so far. Ran a few experiments yday and today. Number of executors (which was the other main handle we wanted to factor in) doesn't seem to have any noticeable impact. Tried a few other parameters such as num_partitions, default_parallelism but nothing sticks. Confirmed the proportionality with container size. Have also been trying to tune the multiplier to minimize potential waste and I think 6% (as opposed to 7% as we currently have) is the lowest we should go. Modifying the PR accordingly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478.2 Fix incorrect NioServerSocketChan...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/1466#discussion_r15076631 --- Diff: external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala --- @@ -153,15 +153,15 @@ class FlumeReceiver( private def initServer() = { if (enableDecompression) { - val channelFactory = new NioServerSocketChannelFactory -(Executors.newCachedThreadPool(), Executors.newCachedThreadPool()); - val channelPipelieFactory = new CompressionChannelPipelineFactory() + val channelFactory = new NioServerSocketChannelFactory(Executors.newCachedThreadPool(), + Executors.newCachedThreadPool()) + val channelPipelineFactory = new CompressionChannelPipelineFactory() --- End diff -- Good catch on the spelling mistake! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478.2 Fix incorrect NioServerSocketChan...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/1466#discussion_r15076673 --- Diff: external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala --- @@ -153,15 +153,15 @@ class FlumeReceiver( private def initServer() = { if (enableDecompression) { - val channelFactory = new NioServerSocketChannelFactory -(Executors.newCachedThreadPool(), Executors.newCachedThreadPool()); - val channelPipelieFactory = new CompressionChannelPipelineFactory() + val channelFactory = new NioServerSocketChannelFactory(Executors.newCachedThreadPool(), + Executors.newCachedThreadPool()) + val channelPipelineFactory = new CompressionChannelPipelineFactory() new NettyServer( responder, new InetSocketAddress(host, port), -channelFactory, -channelPipelieFactory, +channelFactory, +channelPipelineFactory, --- End diff -- There was a typo in the variable name. The other line was just the IDE stripping a trailing space I think. On Jul 17, 2014 7:52 PM, Tathagata Das notificati...@github.com wrote: In external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala: new NettyServer( responder, new InetSocketAddress(host, port), -channelFactory, -channelPipelieFactory, +channelFactory, +channelPipelineFactory, What changed here? Was there a formatting issue? â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/1466/files#r15076612. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1362#discussion_r15076921 --- Diff: core/src/main/scala/org/apache/spark/Dependency.scala --- @@ -32,8 +32,6 @@ abstract class Dependency[T](val rdd: RDD[T]) extends Serializable /** * :: DeveloperApi :: - * Base class for dependencies where each partition of the parent RDD is used by at most one - * partition of the child RDD. Narrow dependencies allow for pipelined execution. --- End diff -- Sounds good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1362#discussion_r15076891 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1107,7 +1106,6 @@ class DAGScheduler( case shufDep: ShuffleDependency[_, _, _] = val mapStage = getShuffleMapStage(shufDep, stage.jobId) if (!mapStage.isAvailable) { -visitedStages += mapStage --- End diff -- Hmm, yeah, it seems that should be there. On the other hand, will tracking the visited RDDs be enough? I guess we need to look into it, but it doesn't seem urgent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1450#issuecomment-49350307 Merged in master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2340] Resolve event logging and History...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1280#issuecomment-49350379 QA results for PR 1280:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16784/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1450 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478.2 Fix incorrect NioServerSocketChan...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1466#issuecomment-49350436 QA results for PR 1466:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16783/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15078093 --- Diff: python/pyspark/rdd.py --- @@ -168,6 +170,123 @@ def _replaceRoot(self, value): self._sink(1) +class Merger(object): + +External merger will dump the aggregated data into disks when memory usage is above +the limit, then merge them together. + + combiner = lambda x, y:x+y + merger = Merger(combiner, 10) + N = 1 + merger.merge(zip(xrange(N), xrange(N)) * 10) + merger.spills +100 + sum(1 for k,v in merger.iteritems()) +1 + + +PARTITIONS = 64 +BATCH = 1000 + +def __init__(self, combiner, memory_limit=256, path=/tmp/pyspark, serializer=None): +self.combiner = combiner +self.path = os.path.join(path, str(os.getpid())) +self.memory_limit = memory_limit +self.serializer = serializer or BatchedSerializer(AutoSerializer(), 1024) +self.item_limit = None +self.data = {} +self.pdata = [] +self.spills = 0 + +def used_memory(self): +rss = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss +if platform.system() == 'Linux': +rss = 10 +elif platform.system() == 'Darwin': +rss = 20 +return rss + +def merge(self, iterator): +iterator = iter(iterator) +d = self.data +comb = self.combiner +c = 0 +for k, v in iterator: +if k in d: +d[k] = comb(d[k], v) +else: +d[k] = v + +if self.item_limit is not None: +continue + +c += 1 +if c % self.BATCH == 0 and self.used_memory() self.memory_limit: +self.item_limit = c +self._first_spill() +self._partitioned_merge(iterator) +return + +def _partitioned_merge(self, iterator): +comb = self.combiner +c = 0 +for k, v in iterator: +d = self.pdata[hash(k) % self.PARTITIONS] +if k in d: +d[k] = comb(d[k], v) +else: +d[k] = v +c += 1 +if c = self.item_limit: +self._spill() +c = 0 + +def _first_spill(self): +path = os.path.join(self.path, str(self.spills)) +if not os.path.exists(path): +os.makedirs(path) +streams = [open(os.path.join(path, str(i)), 'w') + for i in range(self.PARTITIONS)] --- End diff -- Because the data was compressed and decompressed once, so it's better to use lightweight compress method, such as LZ4/Snappy/LZO. Personally, I prefer LZ4, because it has similar compress ratio but with much higher performance than Snappy and LZO, LZF. I will add them as part of BatchedSerializer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---