[GitHub] spark pull request: [SPARK-4651] Adding -Phadoop-2.4+ to compile S...

2014-11-29 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3512#issuecomment-64971933 This has come up before, and I thought we resolved that the hadoop-2.4 profile would be used for all Hadoop versions above 2.5, until a Hadoop version changes dependencies

[GitHub] spark pull request: [SPARK-4584] [yarn] Remove security manager fr...

2014-11-26 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3484#issuecomment-64699567 If this is behavior that we encourage users to avoid, I would think that defaulting to failure when System.exit is called would be preferable. Also, I'm guessing

[GitHub] spark pull request: SPARK-4628: Put all external projects behind a...

2014-11-26 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3485#discussion_r20958933 --- Diff: pom.xml --- @@ -1201,6 +1196,18 @@ /dependencies /profile +!-- External projects are not built in less this flag

[GitHub] spark pull request: SPARK-4628: Put all external projects behind a...

2014-11-26 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3485#issuecomment-64700676 external/kafka will still end up being built without the flag, right? Also, will this not make the default build fail, because the examples depend on the external

[GitHub] spark pull request: SPARK-4628: Put all external projects behind a...

2014-11-26 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3485#issuecomment-64728995 I missed the context for why we would just put mqtt behind the flag. Something in its dependency graph is breaking the build? --- If your project is set up for it, you

[GitHub] spark pull request: [SPARK-4632][Streaming] Upgrade MQTT dependenc...

2014-11-26 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3487#discussion_r20974521 --- Diff: external/mqtt/pom.xml --- @@ -43,8 +43,8 @@ /dependency dependency groupIdorg.eclipse.paho/groupId

[GitHub] spark pull request: SPARK-3779. yarn spark.yarn.applicationMaster....

2014-11-25 Thread sryza
GitHub user sryza opened a pull request: https://github.com/apache/spark/pull/3471 SPARK-3779. yarn spark.yarn.applicationMaster.waitTries config should be... ... changed to a time period You can merge this pull request into a Git repository by running: $ git pull https

[GitHub] spark pull request: SPARK-3779. yarn spark.yarn.applicationMaster....

2014-11-25 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3471#issuecomment-64516335 @tgravescs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request: SPARK-4567. Make SparkJobInfo and SparkStageIn...

2014-11-24 Thread sryza
GitHub user sryza opened a pull request: https://github.com/apache/spark/pull/3426 SPARK-4567. Make SparkJobInfo and SparkStageInfo serializable You can merge this pull request into a Git repository by running: $ git pull https://github.com/sryza/spark sandy-spark-4567

[GitHub] spark pull request: [WIP][SPARK-2926][Shuffle]Add MR style sort-me...

2014-11-24 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3438#discussion_r20845166 --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala --- @@ -17,6 +17,7 @@ package org.apache.spark.rdd +import

[GitHub] spark pull request: [WIP][SPARK-2926][Shuffle]Add MR style sort-me...

2014-11-24 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3438#issuecomment-64317326 The main changes we implemented here are: * When a shuffle operation has a key ordering, sort records by key on the map side in addition to sorting by partition

[GitHub] spark pull request: [SPARK-4534][Core]JavaSparkContext create new ...

2014-11-21 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3403#issuecomment-64018115 preferredNodeLocalityData is currently broken (see SPARK-2089), and we're discussing changing the API for it. I think it would be best to hold off on this change until

[GitHub] spark pull request: [SPARK-4461][YARN] pass extra java options to ...

2014-11-21 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3409#discussion_r20752652 --- Diff: conf/spark-env.sh.template --- @@ -40,6 +40,7 @@ # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. -Dx=y

[GitHub] spark pull request: [SPARK-4461][YARN] pass extra java options to ...

2014-11-21 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3409#issuecomment-64062102 Adding the ability to specify java options for the ExecutorLauncher AM in yarn-client mode sounds reasonable. I think we should use a config property name that's more

[GitHub] spark pull request: Documentation: add description for repartition...

2014-11-20 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3390#issuecomment-63915538 +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request: [SPARK-4472][Shell] Print Spark context avail...

2014-11-19 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3341#discussion_r20563523 --- Diff: repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala --- @@ -61,11 +61,14 @@ class SparkILoop(in0: Option[BufferedReader], protected

[GitHub] spark pull request: [Spark-4484] Treat maxResultSize as unlimited ...

2014-11-19 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3360#discussion_r20563746 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -536,7 +536,7 @@ private[spark] class TaskSetManager

[GitHub] spark pull request: [Spark-4484] Treat maxResultSize as unlimited ...

2014-11-19 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3360#issuecomment-63609847 Is maxResultSize documented anywhere? I couldn't find it. If not, can we add it to the configuration page? --- If your project is set up for it, you can reply

[GitHub] spark pull request: [SPARK-4472][Shell] Print Spark context avail...

2014-11-19 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3341#discussion_r20564849 --- Diff: repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala --- @@ -61,11 +61,14 @@ class SparkILoop(in0: Option[BufferedReader], protected

[GitHub] spark pull request: [Spark-4484] Treat maxResultSize as unlimited ...

2014-11-19 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3360#issuecomment-63720314 I think just referencing the property by its full name and allowing the user to look it up should be sufficient. Recommending boosting the limit is not right in all

[GitHub] spark pull request: [SPARK-4495] Fix memory leak in JobProgressLis...

2014-11-19 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3372#discussion_r20628831 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala --- @@ -40,41 +40,108 @@ class JobProgressListener(conf: SparkConf) extends

[GitHub] spark pull request: [SPARK-4495] Fix memory leak in JobProgressLis...

2014-11-19 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3372#discussion_r20628990 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala --- @@ -40,41 +40,108 @@ class JobProgressListener(conf: SparkConf) extends

[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...

2014-11-19 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3378#issuecomment-63765566 This seems like probably a great idea. Do you know what the overhead of including a classtag is? Does it mean an extra pointer per object? --- If your project is set up

[GitHub] spark pull request: [DOCS][BUILD] Add instruction to use change-ve...

2014-11-19 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3361#issuecomment-63766299 +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request: Corrected build instructions for scala 2.11 in...

2014-11-18 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3339#discussion_r20490017 --- Diff: docs/building-spark.md --- @@ -113,7 +113,17 @@ mvn -Pyarn -Phive -Phive-thriftserver-0.12.0 -Phadoop-2.4 -Dhadoop.version=2.4.0 {% endhighlight

[GitHub] spark pull request: Corrected build instructions for scala 2.11 in...

2014-11-18 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3339#discussion_r20490313 --- Diff: docs/building-spark.md --- @@ -113,7 +113,17 @@ mvn -Pyarn -Phive -Phive-thriftserver-0.12.0 -Phadoop-2.4 -Dhadoop.version=2.4.0 {% endhighlight

[GitHub] spark pull request: Corrected build instructions for scala 2.11 in...

2014-11-18 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3339#discussion_r20490479 --- Diff: docs/building-spark.md --- @@ -113,7 +113,17 @@ mvn -Pyarn -Phive -Phive-thriftserver-0.12.0 -Phadoop-2.4 -Dhadoop.version=2.4.0 {% endhighlight

[GitHub] spark pull request: [SPARK-4472][Shell] Print Spark context avail...

2014-11-18 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3341#discussion_r20523279 --- Diff: repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala --- @@ -61,11 +61,14 @@ class SparkILoop(in0: Option[BufferedReader], protected

[GitHub] spark pull request: [SPARK-4429][BUILD] Build for Scala 2.11 using...

2014-11-18 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3342#discussion_r20523465 --- Diff: project/SparkBuild.scala --- @@ -101,14 +101,10 @@ object SparkBuild extends PomBuild { v.split((\\s+|,)).filterNot(_.isEmpty).map

[GitHub] spark pull request: [SPARK-4429][BUILD] Build for Scala 2.11 using...

2014-11-18 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3342#discussion_r20523647 --- Diff: project/SparkBuild.scala --- @@ -101,14 +101,10 @@ object SparkBuild extends PomBuild { v.split((\\s+|,)).filterNot(_.isEmpty).map

[GitHub] spark pull request: [SPARK-4470] Validate number of threads in loc...

2014-11-18 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3337#discussion_r20523815 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1805,6 +1805,9 @@ object SparkContext extends Logging { def localCpuCount

[GitHub] spark pull request: SPARK-4457. Document how to build for Hadoop v...

2014-11-18 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3322#issuecomment-63551793 Sounds good. Updated the patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-4472][Shell] Print Spark context avail...

2014-11-18 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3341#discussion_r20557393 --- Diff: repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala --- @@ -61,11 +61,14 @@ class SparkILoop(in0: Option[BufferedReader], protected

[GitHub] spark pull request: fix elements read count for ExtrenalSorter

2014-11-17 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3302#issuecomment-63274293 As far as I can tell, `elementsRead` isn't used for anything. Would we be able to just remove it entirely? --- If your project is set up for it, you can reply

[GitHub] spark pull request: fix elements read count for ExtrenalSorter

2014-11-17 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3302#issuecomment-63274353 Also, mind filing a JIRA for this, or, if one already exists, including the name in the title here? --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: SPARK-4338. Ditch yarn-alpha.

2014-11-17 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3215#issuecomment-63275953 The patch does remove the yarn/stable directory. Updated patch includes the doc fix. Currently testing it on a pseudo-distributed cluster. --- If your project

[GitHub] spark pull request: SPARK-4338. Ditch yarn-alpha.

2014-11-17 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3215#issuecomment-63339779 Successfully ran spark-shell in yarn-client mode and an app in yarn-cluster mode. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: SPARK-4445, Don't display storage level in toD...

2014-11-17 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3310#discussion_r20450826 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1309,7 +1309,7 @@ abstract class RDD[T: ClassTag]( def debugSelf (rdd: RDD

[GitHub] spark pull request: SPARK-4338. Ditch yarn-alpha.

2014-11-17 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3215#discussion_r20454469 --- Diff: docs/building-spark.md --- @@ -95,8 +74,11 @@ mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package # Apache Hadoop 2.4.X

[GitHub] spark pull request: SPARK-4445, Don't display storage level in toD...

2014-11-17 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3310#discussion_r20454820 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1309,7 +1309,7 @@ abstract class RDD[T: ClassTag]( def debugSelf (rdd: RDD

[GitHub] spark pull request: [SPARK-4452] fix elements read count for Extre...

2014-11-17 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3302#issuecomment-63355366 That makes sense, my IDE for some reason didn't show me the usage in Spillable.scala. In that case, this change makes sense. Spilling is also based on the amount

[GitHub] spark pull request: [SPARK-4452] fix elements read count for Extre...

2014-11-17 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3302#issuecomment-63362140 I don't entirely understand that line of argument. Why would we want to place a lower bound if the data structure is pushing the memory threshold? I filed https

[GitHub] spark pull request: SPARK-4457. Document how to build for Hadoop v...

2014-11-17 Thread sryza
GitHub user sryza opened a pull request: https://github.com/apache/spark/pull/3322 SPARK-4457. Document how to build for Hadoop versions greater than 2.4 You can merge this pull request into a Git repository by running: $ git pull https://github.com/sryza/spark sandy-spark

[GitHub] spark pull request: SPARK-4338. Ditch yarn-alpha.

2014-11-17 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3215#discussion_r20459377 --- Diff: docs/building-spark.md --- @@ -95,8 +74,11 @@ mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package # Apache Hadoop 2.4.X

[GitHub] spark pull request: SPARK-4445, Don't display storage level in toD...

2014-11-17 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3310#issuecomment-63364931 Not a big deal at all, but was there some reason my other nit didn't apply? One other thing: debugSelf returns s$rdd [$persistence]. Unless this is some Scala

[GitHub] spark pull request: SPARK-4457. Document how to build for Hadoop v...

2014-11-17 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3322#issuecomment-63384470 Updated the patch to warn against Hadoop versions greater than 2.5. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [SPARK-4459] Change groupBy type parameter fro...

2014-11-17 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3327#issuecomment-63392980 `partitionBy` as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-4452] fix elements read count for Extre...

2014-11-17 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3302#issuecomment-63399594 after a while we unconditionally try to spill every 32 elements regardless of whether the in-memory buffer has exceeded the spill threshold. The code still

[GitHub] spark pull request: SPARK-4375. no longer require -Pscala-2.10

2014-11-14 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3239#discussion_r20347403 --- Diff: docs/building-spark.md --- @@ -113,9 +113,9 @@ mvn -Pyarn -Phive -Phive-thriftserver-0.12.0 -Phadoop-2.4 -Dhadoop.version=2.4.0 {% endhighlight

[GitHub] spark pull request: SPARK-4375. no longer require -Pscala-2.10

2014-11-14 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3239#discussion_r20348576 --- Diff: docs/building-spark.md --- @@ -113,9 +113,9 @@ mvn -Pyarn -Phive -Phive-thriftserver-0.12.0 -Phadoop-2.4 -Dhadoop.version=2.4.0 {% endhighlight

[GitHub] spark pull request: SPARK-4214. With dynamic allocation, avoid out...

2014-11-14 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3204#discussion_r20390334 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala --- @@ -360,6 +382,10 @@ private[spark] class ExecutorAllocationManager(sc

[GitHub] spark pull request: SPARK-4214. With dynamic allocation, avoid out...

2014-11-14 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3204#issuecomment-63138957 Cool, just added that as well --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: SPARK-3642. Document the nuances of shared var...

2014-11-14 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/2490#issuecomment-63148833 Thanks for the review @Ishiihara . Updated the PR to clarify these points. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: SPARK-4375. maven rejiggering

2014-11-13 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3239#issuecomment-62869875 I had some additional conversation with @pwendell and we agreed that SPARK-4376 (putting external modules behind maven profiles) is worthwhile, so this PR implements both

[GitHub] spark pull request: SPARK-4375. maven rejiggering

2014-11-13 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3239#issuecomment-62871767 I think Sean is right. Have half a fix for this but am going to go to bed now. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: SPARK-4375. maven rejiggering

2014-11-13 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3239#issuecomment-62934308 @ScrapCodes @pwendell I just tried mvn package -Pscala-2.11 without my patch and still got errors: The following artifacts could not be resolved

[GitHub] spark pull request: SPARK-4375. no longer require -Pscala-2.10 and...

2014-11-13 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3239#discussion_r20329299 --- Diff: sql/catalyst/pom.xml --- @@ -60,6 +60,11 @@ artifactIdscalacheck_${scala.binary.version}/artifactId scopetest/scope

[GitHub] spark pull request: SPARK-4375. no longer require -Pscala-2.10 and...

2014-11-13 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3239#issuecomment-62990268 Here's a patch with a simpler approach that relies on @vanzin 's suggestion of a -Dscala-2.11 property. I still like the idea of putting the external projects

[GitHub] spark pull request: SPARK-4214. With dynamic allocation, avoid out...

2014-11-13 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3204#issuecomment-63005126 Updated patch addresses review comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: SPARK-4375. no longer require -Pscala-2.10

2014-11-13 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3239#issuecomment-63020796 Updated the doc - it seems like there's actually not a ton more to say, but let me know if I missed anything. --- If your project is set up for it, you can reply

[GitHub] spark pull request: SPARK-4338. Ditch yarn-alpha.

2014-11-12 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3215#issuecomment-62763548 I have a whole set of simplifying changes that I want to go in (e.g. YARN-1714), but thought it would probably be good to break things up a bit for easier review

[GitHub] spark pull request: SPARK-4338. Ditch yarn-alpha.

2014-11-12 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3215#issuecomment-62792878 @tgravescs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request: SPARK-4338. Ditch yarn-alpha.

2014-11-12 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3215#discussion_r20251737 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala --- @@ -178,21 +178,25 @@ private[spark] class ClientArguments(args: Array

[GitHub] spark pull request: [SPARK-4092] [CORE] Fix InputMetrics for coale...

2014-11-12 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3120#discussion_r20262822 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -252,7 +258,7 @@ class HadoopRDD[K, V]( bytesReadCallback.isDefined

[GitHub] spark pull request: [SPARK-4092] [CORE] Fix InputMetrics for coale...

2014-11-12 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3120#discussion_r20263076 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -252,7 +258,7 @@ class HadoopRDD[K, V]( bytesReadCallback.isDefined

[GitHub] spark pull request: SPARK-4338. Ditch yarn-alpha.

2014-11-12 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3215#issuecomment-62822018 @tgravescs it would delay other PRs, but not a huge deal if you think it's too soon. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [HOTFIX]: Fix maven build missing some class

2014-11-12 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3228#issuecomment-62850540 That still requires the user to set scala.version, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: SPARK-4375. maven rejiggering

2014-11-12 Thread sryza
GitHub user sryza opened a pull request: https://github.com/apache/spark/pull/3239 SPARK-4375. maven rejiggering It seems like the winds might have moved away from this approach, but wanted to post the PR anyway because I got it working and to show what it would look like. You

[GitHub] spark pull request: SPARK-4375. maven rejiggering

2014-11-12 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3239#issuecomment-62852398 Right, this was before we came to that decision. Will update this to just do Kafka. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: SPARK-4214. With dynamic allocation, avoid out...

2014-11-11 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3204#discussion_r20137693 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala --- @@ -217,14 +223,24 @@ private[spark] class ExecutorAllocationManager(sc

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

2014-11-11 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3000#issuecomment-62588894 Just noticed this. I'd been working on something similar a little while ago on SPARK-1216 / #304. One difference is that I had aimed to accept categorical features

[GitHub] spark pull request: SPARK-4214. With dynamic allocation, avoid out...

2014-11-11 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3204#discussion_r20172344 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala --- @@ -110,6 +110,12 @@ private[spark] class ExecutorAllocationManager(sc

[GitHub] spark pull request: SPARK-4338. Ditch yarn-alpha.

2014-11-11 Thread sryza
GitHub user sryza opened a pull request: https://github.com/apache/spark/pull/3215 SPARK-4338. Ditch yarn-alpha. Sorry if this is a little premature with 1.2 still not out the door, but it will make other work like SPARK-4136 and SPARK-2089 a lot easier. You can merge this pull

[GitHub] spark pull request: SPARK-3837. Warn when YARN kills containers fo...

2014-11-11 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/2744#issuecomment-62662666 @jdanbrown that seems reasonable. Mind filing a JIRA for it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: SPARK-3461. Support external groupByKey using ...

2014-11-10 Thread sryza
GitHub user sryza opened a pull request: https://github.com/apache/spark/pull/3198 SPARK-3461. Support external groupByKey using repartitionAndSortWithinPa... ...rtitions This is a WIP. It still needs tests and probably a better name for the transformation, but I wanted

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-11-10 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/1977#discussion_r20132241 --- Diff: python/pyspark/shuffle.py --- @@ -520,6 +505,295 @@ def sorted(self, iterator, key=None, reverse=False): return heapq.merge(chunks, key

[GitHub] spark pull request: SPARK-3461. Support external groupByKey using ...

2014-11-10 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3198#issuecomment-62502932 Will take a look at #1977. I believe that the most common uses for groupByKey, like writing out partitioned tables, involve iterating over each group a single time

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-11-10 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/1977#discussion_r20132424 --- Diff: python/pyspark/rdd.py --- @@ -1579,21 +1577,34 @@ def createZero(): return self.combineByKey(lambda v: func(createZero(), v), func

[GitHub] spark pull request: SPARK-3461. Support external groupByKey using ...

2014-11-10 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3198#issuecomment-62508663 All good points. Will close this for now. Longer term, it worries me that Spark wouldn't be able to provide an operator that gives comparable performance to what

[GitHub] spark pull request: SPARK-3461. Support external groupByKey using ...

2014-11-10 Thread sryza
Github user sryza closed the pull request at: https://github.com/apache/spark/pull/3198 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: SPARK-4214. With dynamic allocation, avoid out...

2014-11-10 Thread sryza
GitHub user sryza opened a pull request: https://github.com/apache/spark/pull/3204 SPARK-4214. With dynamic allocation, avoid outstanding requests for more... ... executors than pending tasks need. WIP. Still need to add and fix tests. You can merge this pull request

[GitHub] spark pull request: SPARK-4230. Doc for spark.default.parallelism ...

2014-11-09 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3107#discussion_r20060629 --- Diff: docs/configuration.md --- @@ -556,6 +556,9 @@ Apart from these, the following properties are also available, and may be useful tr

[GitHub] spark pull request: SPARK-4230. Doc for spark.default.parallelism ...

2014-11-09 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3107#discussion_r20060674 --- Diff: docs/configuration.md --- @@ -563,8 +566,8 @@ Apart from these, the following properties are also available, and may be useful /ul

[GitHub] spark pull request: SPARK-4230. Doc for spark.default.parallelism ...

2014-11-09 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3107#issuecomment-62339705 Test failure looks unrelated --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

2014-11-08 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/2968#discussion_r20055135 --- Diff: core/src/test/scala/org/apache/spark/metrics/InputOutputMetricsSuite.scala --- @@ -73,4 +78,32 @@ class InputMetricsSuite extends FunSuite

[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

2014-11-06 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/2968#discussion_r19931259 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -249,7 +248,7 @@ class HadoopRDD[K, V]( bytesReadCallback.isDefined

[GitHub] spark pull request: SPARK-3179. Add task OutputMetrics.

2014-11-06 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/2968#issuecomment-61941116 Thanks for taking a look @kayousterhout . I'll add in an output type. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: SPARK-1714. Take advantage of AMRMClient APIs ...

2014-11-06 Thread sryza
Github user sryza closed the pull request at: https://github.com/apache/spark/pull/655 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: SPARK-1216. Add a OneHotEncoder for handling c...

2014-11-06 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/304#issuecomment-62084497 Definitely. Have been waiting for the Pipelines and Parameters PR goes in. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-4291][Build] Rename network module proj...

2014-11-06 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3148#issuecomment-62097234 +1. Was about to suggest such a change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61853890 If maxIter is a constant, would it be clearer to use MAX_ITER? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61858097 Both the reference and the class internals are immutable, no? Typical Java conventions would put such a variable in all caps, though maybe in Scala it's different

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3099#discussion_r19897343 --- Diff: mllib/src/main/scala/org/apache/spark/ml/parameters.scala --- @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

[GitHub] spark pull request: [SPARK-4092] [CORE] Fix InputMetrics for coale...

2014-11-05 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3120#discussion_r19922230 --- Diff: core/src/test/scala/org/apache/spark/metrics/InputMetricsSuite.scala --- @@ -17,8 +17,11 @@ package org.apache.spark.metrics

[GitHub] spark pull request: [SPARK-4092] [CORE] Fix InputMetrics for coale...

2014-11-05 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3120#discussion_r19922310 --- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala --- @@ -44,7 +44,14 @@ private[spark] class CacheManager(blockManager: BlockManager) extends

[GitHub] spark pull request: [SPARK-4092] [CORE] Fix InputMetrics for coale...

2014-11-05 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3120#discussion_r19922345 --- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala --- @@ -173,12 +179,12 @@ class NewHadoopRDD[K, V]( // Update metrics

[GitHub] spark pull request: [SPARK-4092] [CORE] Fix InputMetrics for coale...

2014-11-05 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3120#issuecomment-61917343 Had a few nitpicks. Otherwise, this looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [WIP][SPARK-3797] Run external shuffle service...

2014-11-04 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3082#discussion_r19834936 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala --- @@ -124,6 +125,22 @@ private[spark] class ExecutorAllocationManager(sc

[GitHub] spark pull request: [WIP][SPARK-3797] Run external shuffle service...

2014-11-04 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3082#discussion_r19836489 --- Diff: network/yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java --- @@ -0,0 +1,112 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [WIP][SPARK-3797] Run external shuffle service...

2014-11-04 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3082#discussion_r19836790 --- Diff: network/yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java --- @@ -0,0 +1,112 @@ +/* + * Licensed to the Apache

<    4   5   6   7   8   9   10   11   12   13   >