[GitHub] spark pull request: SPARK-1338 [scalastyle] Ensure or disallow spa...
Github user ScrapCodes closed the pull request at: https://github.com/apache/spark/pull/284 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Set configuration spark.history.retainedAppli...
GitHub user XuTingjun opened a pull request: https://github.com/apache/spark/pull/1509 Set configuration spark.history.retainedApplications be effective When setting spark.history.retainedApplications=1, the historyserver web retains more than one application. You can merge this pull request into a Git repository by running: $ git pull https://github.com/XuTingjun/spark bug-fix1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1509.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1509 commit f4f1a41039bb8313e770394bcf9ddf8a3c0a4d66 Author: XuTingjun 1039320...@qq.com Date: 2014-07-21T08:16:33Z Update FsHistoryProvider.scala --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2497 Included checks for module symbols ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1463#issuecomment-49581362 QA tests have started for PR 1463. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16910/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Set configuration spark.history.retainedAppli...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1509#issuecomment-49581609 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2565. Update ShuffleReadMetrics as block...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1507#issuecomment-49581839 QA results for PR 1507:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16901/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2103][Streaming] Change to ClassTag for...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1508#issuecomment-49581911 QA results for PR 1508:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16900/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with aggr...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1497#discussion_r15158553 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -152,6 +155,34 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool } /** + * This rule finds expressions in HAVING clause filters that depend on + * unresolved attributes. It pushes these expressions down to the underlying + * aggregates and then projects them away above the filter. + */ + object UnresolvedHavingClauseAttributes extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case pl @ Filter(fexp, agg @ Aggregate(_, ae, _)) if !fexp.childrenResolved = { +val alias = Alias(fexp, makeTmp())() +val aggExprs = Seq(alias) ++ ae + +val newCond = EqualTo(Cast(alias.toAttribute, BooleanType), Literal(true, BooleanType)) + +val newFilter = ResolveReferences(pl.copy(condition = newCond, + child = agg.copy(aggregateExpressions = aggExprs))) + +Project(pl.output, newFilter) + } +} + +private val curId = new java.util.concurrent.atomic.AtomicLong() + +private def makeTmp() = { + val id = curId.getAndIncrement() + stmp_cond_$id --- End diff -- [Some more details about NamedExpressions](http://people.apache.org/~marmbrus/docs/catalyst/#org.apache.spark.sql.catalyst.expressions.package) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1502#issuecomment-49582954 QA results for PR 1502:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):br* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).brclass KVArraySortDataFormat[K, T : AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {brclass SorterK, Buffer {brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16902/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1502#issuecomment-49584160 QA results for PR 1502:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):br* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).brclass KVArraySortDataFormat[K, T : AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {brclass SorterK, Buffer {brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16903/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1392#issuecomment-49584362 Hi @andrewor14 , that's ok. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2494] [PySpark] make hash of None consi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1371#issuecomment-49585327 QA results for PR 1371:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16906/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1502#issuecomment-49585784 QA results for PR 1502:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):br* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).brclass KVArraySortDataFormat[K, T : AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {brclass SorterK, Buffer {brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16905/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2549] Functions defined inside of other...
GitHub user ScrapCodes opened a pull request: https://github.com/apache/spark/pull/1510 [SPARK-2549] Functions defined inside of other functions trigger failures You can merge this pull request into a Git repository by running: $ git pull https://github.com/ScrapCodes/spark-1 SPARK-2549/fun-in-fun Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1510.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1510 commit f24adc2b82ad77f64432c71350b88e8d34ffc8dc Author: Prashant Sharma prashan...@imaginea.com Date: 2014-07-21T09:56:56Z SPARK-2549 Functions defined inside of other functions trigger failures --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2549] Functions defined inside of other...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1510#issuecomment-49589031 QA tests have started for PR 1510. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16911/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2497 Included checks for module symbols ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1463#issuecomment-49589063 QA results for PR 1463:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16910/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2103][Streaming] Change to ClassTag for...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/1508#issuecomment-49590837 It's the MIMA test that fails, since the method signature is changed. It's possible to keep and deprecate the existing method of course. Should we just do that, or OK to remove the method on the grounds that the API doesn't quite work? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2549] Functions defined inside of other...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1510#issuecomment-49595816 QA results for PR 1510:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16911/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...
Github user jayunit100 commented on the pull request: https://github.com/apache/spark/pull/1254#issuecomment-49602573 Adding the Drop function to a contrib library of functions (which requires manual import) , as erik suggests, seems like a really good option. I could see such a contrib library also being useful for other isoteric but nevertheless important tasks, like dealing with binary data formats, etc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [YARN] SPARK-2577: File upload to viewfs is br...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/1483#issuecomment-49605112 From my understanding when using viewfs: addDelegationTokens is supposed to get tokens for all the underlying filesystems so it should already have a token for it. @gerashegalov did you test this on secure cluster? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [YARN]In some cases, pages display incorrect i...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/1501#issuecomment-49605790 @witgo, nice catch, can you please file jira and put details of issue. This looks good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix NPE for JsonProtocol
GitHub user witgo opened a pull request: https://github.com/apache/spark/pull/1511 Fix NPE for JsonProtocol You can merge this pull request into a Git repository by running: $ git pull https://github.com/witgo/spark JsonProtocol Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1511.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1511 commit b187cfe8dd1a7ce047f3a2b0e5f43da0630eec83 Author: GuoQiang Li wi...@qq.com Date: 2014-07-21T14:41:32Z Fix NPE for JsonProtocol --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with aggr...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1497#discussion_r15172123 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -152,6 +155,34 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool } /** + * This rule finds expressions in HAVING clause filters that depend on + * unresolved attributes. It pushes these expressions down to the underlying + * aggregates and then projects them away above the filter. + */ + object UnresolvedHavingClauseAttributes extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case pl @ Filter(fexp, agg @ Aggregate(_, ae, _)) if !fexp.childrenResolved = { +val alias = Alias(fexp, makeTmp())() +val aggExprs = Seq(alias) ++ ae + +val newCond = EqualTo(Cast(alias.toAttribute, BooleanType), Literal(true, BooleanType)) + +val newFilter = ResolveReferences(pl.copy(condition = newCond, + child = agg.copy(aggregateExpressions = aggExprs))) + +Project(pl.output, newFilter) + } +} + +private val curId = new java.util.concurrent.atomic.AtomicLong() + +private def makeTmp() = { + val id = curId.getAndIncrement() + stmp_cond_$id --- End diff -- @marmbrus Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1680: use configs for specifying environ...
GitHub user tgravescs opened a pull request: https://github.com/apache/spark/pull/1512 SPARK-1680: use configs for specifying environment variables on YARN Note that this also documents spark.executorEnv.* which to me means its public. If we don't want that please speak up. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgravescs/spark SPARK-1680 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1512.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1512 commit 0b6c045fde1410ef626dc6beff39795467e90079 Author: Thomas Graves tgra...@apache.org Date: 2014-07-21T14:44:21Z use configs for specifying environment variables on YARN --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1680: use configs for specifying environ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1512#issuecomment-49614966 QA tests have started for PR 1512. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16912/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix NPE for JsonProtocol
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1511#issuecomment-49614971 QA tests have started for PR 1511. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16913/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2555] Support configuration spark.sched...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/1462#issuecomment-49616178 cc @kayousterhout as I think she is more familiar with standalone mode and scheduler details. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1707. Remove unnecessary 3 second sleep ...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/634#discussion_r15173511 --- Diff: yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala --- @@ -30,6 +30,11 @@ private[spark] class YarnClientSchedulerBackend( extends CoarseGrainedSchedulerBackend(scheduler, sc.env.actorSystem) with Logging { + if (conf.getOption(spark.scheduler.minRegisteredExecutorsRatio).isEmpty) { +minRegisteredRatio = 0.8 --- End diff -- No real reason other then it might take longer to get 100%. Its just kind of a number we choose to hopefully give the user a good experience without having to wait to long if the cluster is really busy. The user can change it if they want. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1707. Remove unnecessary 3 second sleep ...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/634#discussion_r15173596 --- Diff: yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientClusterScheduler.scala --- @@ -37,14 +37,4 @@ private[spark] class YarnClientClusterScheduler(sc: SparkContext, conf: Configur val retval = YarnAllocationHandler.lookupRack(conf, host) if (retval != null) Some(retval) else None } - - override def postStartHook() { - -super.postStartHook() --- End diff -- YarnClientClusterScheduler extends TaskSchedulerImpl, so it will just fall through to TaskSchedulerImpl.postStartHook() which calls waitBackendReady --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Spark 2557] fix LOCAL_N_REGEX in createTaskSc...
Github user advancedxy commented on the pull request: https://github.com/apache/spark/pull/1464#issuecomment-49617743 Well, could someone review this pr? Or should I close this pr? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2150: Provide direct link to finished ap...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/1094#discussion_r15177387 --- Diff: yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ExecutorLauncher.scala --- @@ -289,7 +289,7 @@ class ExecutorLauncher(args: ApplicationMasterArguments, conf: Configuration, sp .asInstanceOf[FinishApplicationMasterRequest] finishReq.setAppAttemptId(appAttemptId) finishReq.setFinishApplicationStatus(status) - finishReq.setTrackingUrl(sparkConf.get(spark.yarn.historyServer.address, )) + finishReq.setTrackingUrl(sparkConf.get(spark.driver.appUIHistoryAddress, )) --- End diff -- typo, need the other in the default case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2608] fix executor backend launch commo...
GitHub user scwf opened a pull request: https://github.com/apache/spark/pull/1513 [SPARK-2608] fix executor backend launch commond over mesos mode mesos scheduler backend use spark-class/spark-executor to launch executor backend, this will lead to problems: 1 when set spark.executor.extraJavaOptions CoarseMesosSchedulerBackend will throw errors because of the launch command ./bin/spark-class org.apache.spark.executor.CoarseGrainedExecutorBackend %s %s %s %s %d).format(basename, extraOpts, driverUrl, offer.getSlaveId.getValue,offer.getHostname, numCores)) 2 spark.executor.extraJavaOptions and spark.executor.extraLibraryPath set in sparkconf will not be valid You can merge this pull request into a Git repository by running: $ git pull https://github.com/scwf/spark mesosfix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1513.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1513 commit 5b150a481a1ddb3dee4e94e14e737f2d20313f5c Author: scwf wangfei1.huawei.com Date: 2014-07-21T15:23:59Z fix executor backend launch commond over mesos mode commit fdc3cb1efa931780f652dcb4d2cda0304ca50710 Author: scwf wangfei1.huawei.com Date: 2014-07-21T15:32:09Z fix code format --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2608] fix executor backend launch commo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1513#issuecomment-49627746 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix NPE for JsonProtocol
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1511#issuecomment-49629586 QA results for PR 1511:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16913/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1680: use configs for specifying environ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1512#issuecomment-49629675 QA results for PR 1512:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16912/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2494] [PySpark] make hash of None consi...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1371#issuecomment-49629961 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2494] [PySpark] make hash of None consi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1371#issuecomment-49630747 QA tests have started for PR 1371. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16914/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix flakey HiveQuerySuite test
GitHub user aarondav opened a pull request: https://github.com/apache/spark/pull/1514 Fix flakey HiveQuerySuite test Result may not be returned in the expected order, so relax that constraint. You can merge this pull request into a Git repository by running: $ git pull https://github.com/aarondav/spark flakey Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1514.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1514 commit e5af823eef0fb42c5acbd071ddb9453534a3fd0c Author: Aaron Davidson aa...@databricks.com Date: 2014-07-21T16:58:36Z Fix flakey HiveQuerySuite test Result may not be returned in the expected order, so relax that constraint. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/1502#issuecomment-49634069 (This PR passed Jenkins 3 times and then failed inside HiveContext -- it's probably OK. I submitted https://github.com/apache/spark/pull/1514 to fix the flakey test.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix flakey HiveQuerySuite test
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1514#issuecomment-49634232 QA tests have started for PR 1514. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16915/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1680: use configs for specifying environ...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/1512#issuecomment-49634254 I need to update so the documentation shows up properly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1680: use configs for specifying environ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1512#issuecomment-49635472 QA tests have started for PR 1512. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16916/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1680: use configs for specifying environ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1512#issuecomment-49636071 QA tests have started for PR 1512. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16917/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1707. Remove unnecessary 3 second sleep ...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/634#issuecomment-49643611 Looks good. +1, Thanks @sryza. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2494] [PySpark] make hash of None consi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1371#issuecomment-49643677 QA results for PR 1371:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16914/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2555] Support configuration spark.sched...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/1462#issuecomment-49643818 If we change the name of the config you'll need to upmerge as https://github.com/apache/spark/pull/634 set some defaults on the yarn side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1707. Remove unnecessary 3 second sleep ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/634 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...
GitHub user brkyvz opened a pull request: https://github.com/apache/spark/pull/1515 [SPARK-2434][MLlib]: Warning messages that point users to original MLlib implementations added to Examples [SPARK-2434][MLlib]: Warning messages that refer users to the original MLlib implementations of some popular example machine learning algorithms added both in the comments and the code. The following examples have been modified: Scala: * LocalALS * LocalFileLR * LocalKMeans * LocalLP * SparkALS * SparkHdfsLR * SparkKMeans * SparkLR Python: * kmeans.py * als.py * logistic_regression.py You can merge this pull request into a Git repository by running: $ git pull https://github.com/brkyvz/spark SPARK-2434 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1515.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1515 commit 2cb53011985161307a1064982ae26b6fe2e69130 Author: Burak brk...@gmail.com Date: 2014-07-17T01:06:52Z SPARK-2434: Warning messages redirecting to original implementaions added. commit 17d3d836de4c3441d37afca14567cc31a90fd1cc Author: Burak brk...@gmail.com Date: 2014-07-17T17:02:32Z SPARK-2434: Added warning messages to the naive implementations of the example algorithms commit 4762f3930659a6303304021b6ede23b9489bb2f6 Author: Burak brk...@gmail.com Date: 2014-07-18T00:14:19Z [SPARK-2434]: Warning messages added commit b6c35b783c80efee9a8a9f4db64ad6a5210455d8 Author: brkyvz brk...@gmail.com Date: 2014-07-21T17:34:40Z [SPARK-2434]: Warning messages added commit 5f84e4be6c07230afd11d255ce88ca0affc0677f Author: brkyvz brk...@gmail.com Date: 2014-07-21T18:18:00Z [SPARK-2434]: Warning messages added --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1515#issuecomment-49645784 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2494] [PySpark] make hash of None consi...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1371#issuecomment-49646830 The JVM fork one python daemon(daemon.py), then the daemon fork all the workers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix flakey HiveQuerySuite test
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1514#issuecomment-49647027 QA results for PR 1514:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16915/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49647944 QA tests have started for PR 1460. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16918/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2494] [PySpark] make hash of None consi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1371 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1680: use configs for specifying environ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1512#issuecomment-49649190 QA results for PR 1512:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16917/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2494] [PySpark] make hash of None consi...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1371#issuecomment-49650014 Ah right, that makes sense. I've merged this in now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15188798 --- Diff: python/pyspark/rdd.py --- @@ -168,6 +169,18 @@ def _replaceRoot(self, value): self._sink(1) +def _parse_memory(s): --- End diff -- Add a comment to this saying it returns a number in MB --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2269 Refactor mesos scheduler resourceOf...
Github user tnachen commented on the pull request: https://github.com/apache/spark/pull/1487#issuecomment-49651731 @pwendell The console said the test failed but in a very unrelated way, is CI failing in general? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-49655673 @akopich Thanks for working on PLSA! This is a big feature and it introduces many public traits/classes. Could you please summarize the public methods? Some of them may be unnecessary to expose to end users and we should hide them. The other issue is about the code style. Please follow the guide at https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide and update the PR, for example: 1. remove created by ... comments generated by intellij 2. organize imports into groups: java, scala, 3rd party, and spark 3. doc for every trait/class. Some only contain doc for parameters but miss the summary. 4. indentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49658257 QA tests have started for PR 1460. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16919/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15192915 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.collection + +import java.io._ +import java.util.Comparator + +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable + +import com.google.common.io.ByteStreams + +import org.apache.spark.{Aggregator, SparkEnv, Logging, Partitioner} +import org.apache.spark.serializer.Serializer +import org.apache.spark.storage.BlockId + +/** + * Sorts and potentially merges a number of key-value pairs of type (K, V) to produce key-combiner + * pairs of type (K, C). Uses a Partitioner to first group the keys into partitions, and then + * optionally sorts keys within each partition using a custom Comparator. Can output a single + * partitioned file with a different byte range for each partition, suitable for shuffle fetches. + * + * If combining is disabled, the type C must equal V -- we'll cast the objects at the end. + * + * @param aggregator optional Aggregator with combine functions to use for merging data + * @param partitioner optional partitioner; if given, sort by partition ID and then key + * @param ordering optional ordering to sort keys within each partition + * @param serializer serializer to use when spilling to disk + */ +private[spark] class ExternalSorter[K, V, C]( +aggregator: Option[Aggregator[K, V, C]] = None, +partitioner: Option[Partitioner] = None, +ordering: Option[Ordering[K]] = None, +serializer: Option[Serializer] = None) extends Logging { + + private val numPartitions = partitioner.map(_.numPartitions).getOrElse(1) + private val shouldPartition = numPartitions 1 + + private val blockManager = SparkEnv.get.blockManager + private val diskBlockManager = blockManager.diskBlockManager + private val ser = Serializer.getSerializer(serializer) + private val serInstance = ser.newInstance() + + private val conf = SparkEnv.get.conf + private val fileBufferSize = conf.getInt(spark.shuffle.file.buffer.kb, 100) * 1024 + private val serializerBatchSize = conf.getLong(spark.shuffle.spill.batchSize, 1) + + private def getPartition(key: K): Int = { +if (shouldPartition) partitioner.get.getPartition(key) else 0 + } + + // Data structures to store in-memory objects before we spill. Depending on whether we have an + // Aggregator set, we either put objects into an AppendOnlyMap where we combine them, or we + // store them in an array buffer. + var map = new SizeTrackingAppendOnlyMap[(Int, K), C] + var buffer = new SizeTrackingBuffer[((Int, K), C)] + + // Track how many elements we've read before we try to estimate memory. Ideally we'd use + // map.size or buffer.size for this, but because users' Aggregators can potentially increase + // the size of a merged element when we add values with the same key, it's safer to track + // elements read from the input iterator. + private var elementsRead = 0L + private val trackMemoryThreshold = 1000 + + // Spilling statistics + private var spillCount = 0 + private var _memoryBytesSpilled = 0L + private var _diskBytesSpilled = 0L + + // Collective memory threshold shared across all running tasks + private val maxMemoryThreshold = { +val memoryFraction = conf.getDouble(spark.shuffle.memoryFraction, 0.3) +val safetyFraction = conf.getDouble(spark.shuffle.safetyFraction, 0.8) +(Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction).toLong + } + + // A comparator for keys K that orders them within a partition to allow partial aggregation. + // Can be a partial ordering by hash code if a total ordering is not provided through by the
[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15192894 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.collection + +import java.io._ +import java.util.Comparator + +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable + +import com.google.common.io.ByteStreams + +import org.apache.spark.{Aggregator, SparkEnv, Logging, Partitioner} +import org.apache.spark.serializer.Serializer +import org.apache.spark.storage.BlockId + +/** + * Sorts and potentially merges a number of key-value pairs of type (K, V) to produce key-combiner + * pairs of type (K, C). Uses a Partitioner to first group the keys into partitions, and then + * optionally sorts keys within each partition using a custom Comparator. Can output a single + * partitioned file with a different byte range for each partition, suitable for shuffle fetches. + * + * If combining is disabled, the type C must equal V -- we'll cast the objects at the end. + * + * @param aggregator optional Aggregator with combine functions to use for merging data + * @param partitioner optional partitioner; if given, sort by partition ID and then key + * @param ordering optional ordering to sort keys within each partition + * @param serializer serializer to use when spilling to disk + */ +private[spark] class ExternalSorter[K, V, C]( +aggregator: Option[Aggregator[K, V, C]] = None, +partitioner: Option[Partitioner] = None, +ordering: Option[Ordering[K]] = None, +serializer: Option[Serializer] = None) extends Logging { + + private val numPartitions = partitioner.map(_.numPartitions).getOrElse(1) + private val shouldPartition = numPartitions 1 + + private val blockManager = SparkEnv.get.blockManager + private val diskBlockManager = blockManager.diskBlockManager + private val ser = Serializer.getSerializer(serializer) + private val serInstance = ser.newInstance() + + private val conf = SparkEnv.get.conf + private val fileBufferSize = conf.getInt(spark.shuffle.file.buffer.kb, 100) * 1024 + private val serializerBatchSize = conf.getLong(spark.shuffle.spill.batchSize, 1) + + private def getPartition(key: K): Int = { +if (shouldPartition) partitioner.get.getPartition(key) else 0 + } + + // Data structures to store in-memory objects before we spill. Depending on whether we have an + // Aggregator set, we either put objects into an AppendOnlyMap where we combine them, or we + // store them in an array buffer. + var map = new SizeTrackingAppendOnlyMap[(Int, K), C] + var buffer = new SizeTrackingBuffer[((Int, K), C)] + + // Track how many elements we've read before we try to estimate memory. Ideally we'd use + // map.size or buffer.size for this, but because users' Aggregators can potentially increase + // the size of a merged element when we add values with the same key, it's safer to track + // elements read from the input iterator. + private var elementsRead = 0L + private val trackMemoryThreshold = 1000 + + // Spilling statistics + private var spillCount = 0 + private var _memoryBytesSpilled = 0L + private var _diskBytesSpilled = 0L + + // Collective memory threshold shared across all running tasks + private val maxMemoryThreshold = { +val memoryFraction = conf.getDouble(spark.shuffle.memoryFraction, 0.3) +val safetyFraction = conf.getDouble(spark.shuffle.safetyFraction, 0.8) +(Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction).toLong + } + + // A comparator for keys K that orders them within a partition to allow partial aggregation. + // Can be a partial ordering by hash code if a total ordering is not provided through by the
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49660273 QA results for PR 1460:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brclass AutoSerializer(FramedSerializer):brclass Merger(object):brclass MapMerger(Merger):brclass ExternalHashMapMerger(Merger):brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16918/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15193928 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.collection + +import java.io._ +import java.util.Comparator + +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable + +import com.google.common.io.ByteStreams + +import org.apache.spark.{Aggregator, SparkEnv, Logging, Partitioner} +import org.apache.spark.serializer.Serializer +import org.apache.spark.storage.BlockId + +/** + * Sorts and potentially merges a number of key-value pairs of type (K, V) to produce key-combiner + * pairs of type (K, C). Uses a Partitioner to first group the keys into partitions, and then + * optionally sorts keys within each partition using a custom Comparator. Can output a single + * partitioned file with a different byte range for each partition, suitable for shuffle fetches. + * + * If combining is disabled, the type C must equal V -- we'll cast the objects at the end. + * + * @param aggregator optional Aggregator with combine functions to use for merging data + * @param partitioner optional partitioner; if given, sort by partition ID and then key + * @param ordering optional ordering to sort keys within each partition + * @param serializer serializer to use when spilling to disk + */ +private[spark] class ExternalSorter[K, V, C]( +aggregator: Option[Aggregator[K, V, C]] = None, +partitioner: Option[Partitioner] = None, +ordering: Option[Ordering[K]] = None, +serializer: Option[Serializer] = None) extends Logging { + + private val numPartitions = partitioner.map(_.numPartitions).getOrElse(1) + private val shouldPartition = numPartitions 1 + + private val blockManager = SparkEnv.get.blockManager + private val diskBlockManager = blockManager.diskBlockManager + private val ser = Serializer.getSerializer(serializer) + private val serInstance = ser.newInstance() + + private val conf = SparkEnv.get.conf + private val fileBufferSize = conf.getInt(spark.shuffle.file.buffer.kb, 100) * 1024 + private val serializerBatchSize = conf.getLong(spark.shuffle.spill.batchSize, 1) + + private def getPartition(key: K): Int = { +if (shouldPartition) partitioner.get.getPartition(key) else 0 + } + + // Data structures to store in-memory objects before we spill. Depending on whether we have an + // Aggregator set, we either put objects into an AppendOnlyMap where we combine them, or we + // store them in an array buffer. + var map = new SizeTrackingAppendOnlyMap[(Int, K), C] + var buffer = new SizeTrackingBuffer[((Int, K), C)] + + // Track how many elements we've read before we try to estimate memory. Ideally we'd use + // map.size or buffer.size for this, but because users' Aggregators can potentially increase + // the size of a merged element when we add values with the same key, it's safer to track + // elements read from the input iterator. + private var elementsRead = 0L + private val trackMemoryThreshold = 1000 + + // Spilling statistics + private var spillCount = 0 + private var _memoryBytesSpilled = 0L + private var _diskBytesSpilled = 0L + + // Collective memory threshold shared across all running tasks + private val maxMemoryThreshold = { +val memoryFraction = conf.getDouble(spark.shuffle.memoryFraction, 0.3) +val safetyFraction = conf.getDouble(spark.shuffle.safetyFraction, 0.8) +(Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction).toLong + } + + // A comparator for keys K that orders them within a partition to allow partial aggregation. + // Can be a partial ordering by hash code if a total ordering is not provided through by the
[GitHub] spark pull request: [SPARK-2555] Support configuration spark.sched...
Github user kayousterhout commented on the pull request: https://github.com/apache/spark/pull/1462#issuecomment-49661602 @tgravescs I actually mentioned this race condition in the previous PR: https://github.com/apache/spark/pull/900#diff-for-comment-14205738 . In the future we should try to be more careful about merging things that have un-replied to comments (I'm about to send an email to the dev list about this). @li-zhihui if someone points out a problem in a pull request you submit, the expectation is that it will be fixed when you reply to the comment. Can you please submit a new pull request that fixes the race condition with standalone mode, before we proceed with adding this functionality to Mesos mode? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1165#discussion_r15194471 --- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala --- @@ -140,14 +145,36 @@ private[spark] class CacheManager(blockManager: BlockManager) extends Logging { throw new BlockException(key, sBlock manager failed to return cached value for $key!) } } else { - /* This RDD is to be cached in memory. In this case we cannot pass the computed values + /* + * This RDD is to be cached in memory. In this case we cannot pass the computed values * to the BlockManager as an iterator and expect to read it back later. This is because - * we may end up dropping a partition from memory store before getting it back, e.g. - * when the entirety of the RDD does not fit in memory. */ - val elements = new ArrayBuffer[Any] - elements ++= values - updatedBlocks ++= blockManager.put(key, elements, storageLevel, tellMaster = true) - elements.iterator.asInstanceOf[Iterator[T]] + * we may end up dropping a partition from memory store before getting it back. + * + * In addition, we must be careful to not unroll the entire partition in memory at once. + * Otherwise, we may cause an OOM exception if the JVM does not have enough space for this + * single partition. Instead, we unroll the values cautiously, potentially aborting and + * dropping the partition to disk if applicable. + */ + blockManager.memoryStore.unrollSafely(key, values, updatedBlocks) match { +case Left(arr) = + // We have successfully unrolled the entire partition, so cache it in memory + updatedBlocks ++= +blockManager.putArray(key, arr, level, tellMaster = true, effectiveStorageLevel) + arr.iterator.asInstanceOf[Iterator[T]] +case Right(it) = + // There is not enough space to cache this partition in memory + logWarning(sNot enough space to cache $key in memory! + +sFree memory is ${blockManager.memoryStore.freeMemory} bytes.) + var returnValues = it.asInstanceOf[Iterator[T]] + if (putLevel.useDisk) { +logWarning(sPersisting $key to disk instead.) +val diskOnlyLevel = StorageLevel(useDisk = true, useMemory = false, + useOffHeap = false, deserialized = false, putLevel.replication) +returnValues = + putInBlockManager[T](key, returnValues, level, updatedBlocks, Some(diskOnlyLevel)) + } + returnValues + } --- End diff -- Yes. The existing BM interface does not return the values you just put (for good reasons), but once we add it, this entire method can be simplified to a wrapper, and there won't be duplicate logic between here and the `get` from disk case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15194652 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.collection + +import java.io._ +import java.util.Comparator + +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable + +import com.google.common.io.ByteStreams + +import org.apache.spark.{Aggregator, SparkEnv, Logging, Partitioner} +import org.apache.spark.serializer.Serializer +import org.apache.spark.storage.BlockId + +/** + * Sorts and potentially merges a number of key-value pairs of type (K, V) to produce key-combiner + * pairs of type (K, C). Uses a Partitioner to first group the keys into partitions, and then + * optionally sorts keys within each partition using a custom Comparator. Can output a single + * partitioned file with a different byte range for each partition, suitable for shuffle fetches. + * + * If combining is disabled, the type C must equal V -- we'll cast the objects at the end. + * + * @param aggregator optional Aggregator with combine functions to use for merging data + * @param partitioner optional partitioner; if given, sort by partition ID and then key + * @param ordering optional ordering to sort keys within each partition + * @param serializer serializer to use when spilling to disk + */ +private[spark] class ExternalSorter[K, V, C]( +aggregator: Option[Aggregator[K, V, C]] = None, +partitioner: Option[Partitioner] = None, +ordering: Option[Ordering[K]] = None, +serializer: Option[Serializer] = None) extends Logging { + + private val numPartitions = partitioner.map(_.numPartitions).getOrElse(1) + private val shouldPartition = numPartitions 1 + + private val blockManager = SparkEnv.get.blockManager + private val diskBlockManager = blockManager.diskBlockManager + private val ser = Serializer.getSerializer(serializer) + private val serInstance = ser.newInstance() + + private val conf = SparkEnv.get.conf + private val fileBufferSize = conf.getInt(spark.shuffle.file.buffer.kb, 100) * 1024 + private val serializerBatchSize = conf.getLong(spark.shuffle.spill.batchSize, 1) + + private def getPartition(key: K): Int = { +if (shouldPartition) partitioner.get.getPartition(key) else 0 + } + + // Data structures to store in-memory objects before we spill. Depending on whether we have an + // Aggregator set, we either put objects into an AppendOnlyMap where we combine them, or we + // store them in an array buffer. + var map = new SizeTrackingAppendOnlyMap[(Int, K), C] + var buffer = new SizeTrackingBuffer[((Int, K), C)] + + // Track how many elements we've read before we try to estimate memory. Ideally we'd use + // map.size or buffer.size for this, but because users' Aggregators can potentially increase + // the size of a merged element when we add values with the same key, it's safer to track + // elements read from the input iterator. + private var elementsRead = 0L + private val trackMemoryThreshold = 1000 + + // Spilling statistics + private var spillCount = 0 + private var _memoryBytesSpilled = 0L + private var _diskBytesSpilled = 0L + + // Collective memory threshold shared across all running tasks + private val maxMemoryThreshold = { +val memoryFraction = conf.getDouble(spark.shuffle.memoryFraction, 0.3) +val safetyFraction = conf.getDouble(spark.shuffle.safetyFraction, 0.8) +(Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction).toLong + } + + // A comparator for keys K that orders them within a partition to allow partial aggregation. + // Can be a partial ordering by hash code if a total ordering is not provided through by the
[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1165#discussion_r15194738 --- Diff: core/src/main/scala/org/apache/spark/util/collection/SizeTrackingAppendOnlyBuffer.scala --- @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.collection + +import scala.reflect.ClassTag + +/** + * An append-only buffer that keeps track of its estimated size in bytes. + */ +private[spark] class SizeTrackingAppendOnlyBuffer[T: ClassTag] + extends PrimitiveVector[T] + with SizeTracker { + + override def +=(value: T): Unit = { +super.+=(value) +super.afterUpdate() + } + + override def resize(newLength: Int): PrimitiveVector[T] = { +super.resize(newLength) +resetSamples() +this + } + + override def array: Array[T] = { --- End diff -- It was called `array` because this overrides `PrimitiveVector#array`, which returns the untrimmed version. It might be a little confusing for this class to have both a `toArray` method and an `array` method that do different things --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1165#discussion_r15195205 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -463,16 +463,15 @@ private[spark] class BlockManager( val values = dataDeserialize(blockId, bytes) if (level.deserialized) { // Cache the values before returning them -// TODO: Consider creating a putValues that also takes in a iterator? -val valuesBuffer = new ArrayBuffer[Any] -valuesBuffer ++= values -memoryStore.putValues(blockId, valuesBuffer, level, returnValues = true).data - match { -case Left(values2) = - return Some(new BlockResult(values2, DataReadMethod.Disk, info.size)) -case _ = - throw new SparkException(Memory store did not return back an iterator) - } +val putResult = memoryStore.putValues( + blockId, values, level, returnValues = true, allowPersistToDisk = false) +putResult.data match { + case Left(it) = +return Some(new BlockResult(it, DataReadMethod.Disk, info.size)) + case _ = +// This only happens if we dropped the values back to disk (which is never) +throw new SparkException(Memory store did not return an iterator!) +} --- End diff -- Ok, I added it in my latest commit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15195282 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.collection + +import java.io._ +import java.util.Comparator + +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable + +import com.google.common.io.ByteStreams + +import org.apache.spark.{Aggregator, SparkEnv, Logging, Partitioner} +import org.apache.spark.serializer.Serializer +import org.apache.spark.storage.BlockId + +/** + * Sorts and potentially merges a number of key-value pairs of type (K, V) to produce key-combiner + * pairs of type (K, C). Uses a Partitioner to first group the keys into partitions, and then + * optionally sorts keys within each partition using a custom Comparator. Can output a single + * partitioned file with a different byte range for each partition, suitable for shuffle fetches. + * + * If combining is disabled, the type C must equal V -- we'll cast the objects at the end. + * + * @param aggregator optional Aggregator with combine functions to use for merging data + * @param partitioner optional partitioner; if given, sort by partition ID and then key + * @param ordering optional ordering to sort keys within each partition + * @param serializer serializer to use when spilling to disk + */ +private[spark] class ExternalSorter[K, V, C]( +aggregator: Option[Aggregator[K, V, C]] = None, +partitioner: Option[Partitioner] = None, +ordering: Option[Ordering[K]] = None, +serializer: Option[Serializer] = None) extends Logging { + + private val numPartitions = partitioner.map(_.numPartitions).getOrElse(1) + private val shouldPartition = numPartitions 1 + + private val blockManager = SparkEnv.get.blockManager + private val diskBlockManager = blockManager.diskBlockManager + private val ser = Serializer.getSerializer(serializer) + private val serInstance = ser.newInstance() + + private val conf = SparkEnv.get.conf + private val fileBufferSize = conf.getInt(spark.shuffle.file.buffer.kb, 100) * 1024 + private val serializerBatchSize = conf.getLong(spark.shuffle.spill.batchSize, 1) + + private def getPartition(key: K): Int = { +if (shouldPartition) partitioner.get.getPartition(key) else 0 + } + + // Data structures to store in-memory objects before we spill. Depending on whether we have an + // Aggregator set, we either put objects into an AppendOnlyMap where we combine them, or we + // store them in an array buffer. + var map = new SizeTrackingAppendOnlyMap[(Int, K), C] + var buffer = new SizeTrackingBuffer[((Int, K), C)] + + // Track how many elements we've read before we try to estimate memory. Ideally we'd use + // map.size or buffer.size for this, but because users' Aggregators can potentially increase + // the size of a merged element when we add values with the same key, it's safer to track + // elements read from the input iterator. + private var elementsRead = 0L + private val trackMemoryThreshold = 1000 + + // Spilling statistics + private var spillCount = 0 + private var _memoryBytesSpilled = 0L + private var _diskBytesSpilled = 0L + + // Collective memory threshold shared across all running tasks + private val maxMemoryThreshold = { +val memoryFraction = conf.getDouble(spark.shuffle.memoryFraction, 0.3) +val safetyFraction = conf.getDouble(spark.shuffle.safetyFraction, 0.8) +(Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction).toLong + } + + // A comparator for keys K that orders them within a partition to allow partial aggregation. + // Can be a partial ordering by hash code if a total ordering is not provided through by the
[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1165#discussion_r15195362 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -561,13 +562,14 @@ private[spark] class BlockManager( iter } - def put( + def putIterator( --- End diff -- This is renamed to avoid method signature conflicts with the new `putArray`, since this PR introduced a new optional parameter `effectiveStorageLevel`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2086] Improve output of toDebugString t...
Github user GregOwen commented on the pull request: https://github.com/apache/spark/pull/1364#issuecomment-49664457 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15196170 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.collection + +import java.io._ +import java.util.Comparator + +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable + +import com.google.common.io.ByteStreams + +import org.apache.spark.{Aggregator, SparkEnv, Logging, Partitioner} +import org.apache.spark.serializer.Serializer +import org.apache.spark.storage.BlockId + +/** + * Sorts and potentially merges a number of key-value pairs of type (K, V) to produce key-combiner + * pairs of type (K, C). Uses a Partitioner to first group the keys into partitions, and then + * optionally sorts keys within each partition using a custom Comparator. Can output a single + * partitioned file with a different byte range for each partition, suitable for shuffle fetches. + * + * If combining is disabled, the type C must equal V -- we'll cast the objects at the end. + * + * @param aggregator optional Aggregator with combine functions to use for merging data + * @param partitioner optional partitioner; if given, sort by partition ID and then key + * @param ordering optional ordering to sort keys within each partition + * @param serializer serializer to use when spilling to disk + */ +private[spark] class ExternalSorter[K, V, C]( +aggregator: Option[Aggregator[K, V, C]] = None, +partitioner: Option[Partitioner] = None, +ordering: Option[Ordering[K]] = None, +serializer: Option[Serializer] = None) extends Logging { + + private val numPartitions = partitioner.map(_.numPartitions).getOrElse(1) + private val shouldPartition = numPartitions 1 + + private val blockManager = SparkEnv.get.blockManager + private val diskBlockManager = blockManager.diskBlockManager + private val ser = Serializer.getSerializer(serializer) + private val serInstance = ser.newInstance() + + private val conf = SparkEnv.get.conf + private val fileBufferSize = conf.getInt(spark.shuffle.file.buffer.kb, 100) * 1024 + private val serializerBatchSize = conf.getLong(spark.shuffle.spill.batchSize, 1) + + private def getPartition(key: K): Int = { +if (shouldPartition) partitioner.get.getPartition(key) else 0 + } + + // Data structures to store in-memory objects before we spill. Depending on whether we have an + // Aggregator set, we either put objects into an AppendOnlyMap where we combine them, or we + // store them in an array buffer. + var map = new SizeTrackingAppendOnlyMap[(Int, K), C] + var buffer = new SizeTrackingBuffer[((Int, K), C)] + + // Track how many elements we've read before we try to estimate memory. Ideally we'd use + // map.size or buffer.size for this, but because users' Aggregators can potentially increase + // the size of a merged element when we add values with the same key, it's safer to track + // elements read from the input iterator. + private var elementsRead = 0L + private val trackMemoryThreshold = 1000 + + // Spilling statistics + private var spillCount = 0 + private var _memoryBytesSpilled = 0L + private var _diskBytesSpilled = 0L + + // Collective memory threshold shared across all running tasks + private val maxMemoryThreshold = { +val memoryFraction = conf.getDouble(spark.shuffle.memoryFraction, 0.3) +val safetyFraction = conf.getDouble(spark.shuffle.safetyFraction, 0.8) +(Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction).toLong + } + + // A comparator for keys K that orders them within a partition to allow partial aggregation. + // Can be a partial ordering by hash code if a total ordering is not provided through by the
[GitHub] spark pull request: [SPARK-2086] Improve output of toDebugString t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1364#issuecomment-49665190 QA tests have started for PR 1364. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16921/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1515#issuecomment-49665253 Jenkins, add to whitelist. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1515#issuecomment-49665265 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2086] Improve output of toDebugString t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1364#issuecomment-49665245 QA results for PR 1364:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16921/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1515#issuecomment-49665852 QA tests have started for PR 1515. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16922/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1515#issuecomment-49665933 QA results for PR 1515:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16922/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1515#discussion_r15196712 --- Diff: examples/src/main/scala/org/apache/spark/examples/LocalALS.scala --- @@ -117,7 +129,7 @@ object LocalALS { } case _ = { System.err.println(Usage: LocalALS M U F iters) -System.exit(1) --- End diff -- Why is this removed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1515#discussion_r15196701 --- Diff: examples/src/main/python/als.py --- @@ -17,6 +17,8 @@ This example requires numpy (http://www.numpy.org/) +This is an example implementation of ALS for learning how to use Spark. Please refer to --- End diff -- It may be better to move this sentence before This example requires numpy. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1515#discussion_r15196704 --- Diff: examples/src/main/python/als.py --- @@ -49,6 +51,9 @@ def update(i, vec, mat, ratings): if __name__ == __main__: + +print WARNING: THIS IS A NAIVE IMPLEMENTATION OF ALS AND IS GIVEN AS AN EXAMPLE! --- End diff -- print sys.stderr, WARNING` (output to stderr) This block should be under the `Usage` block because `print` is part of the execution, which should come after doc. Only capitalizing `WARNING` or shorter `WARN` should be sufficient. It is hard to read with all uppercase letters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1515#discussion_r15196708 --- Diff: examples/src/main/scala/org/apache/spark/examples/LocalALS.scala --- @@ -24,7 +24,8 @@ import cern.colt.matrix.linalg._ import cern.jet.math._ /** - * Alternating least squares matrix factorization. + * Alternating least squares matrix factorization. This is an example implementation for learning how to use Spark. --- End diff -- Line is too long. please make it under 100 characters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1515#discussion_r15196717 --- Diff: examples/src/main/scala/org/apache/spark/examples/LocalFileLR.scala --- @@ -21,6 +21,10 @@ import java.util.Random import breeze.linalg.{Vector, DenseVector} +/** + * Logistic regression based classification. This is an example implementation for learning how to use Spark. --- End diff -- Line too wide. The default setting in intellij uses 120 but in Spark we use 100. You can adjust the number in settings. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1515#discussion_r15196747 --- Diff: examples/src/main/scala/org/apache/spark/examples/SparkALS.scala --- @@ -26,7 +26,8 @@ import cern.jet.math._ import org.apache.spark._ /** - * Alternating least squares matrix factorization. + * Alternating least squares matrix factorization. This is an example implementation for learning how to use Spark. --- End diff -- ditto: too long --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1515#discussion_r15196775 --- Diff: examples/src/main/scala/org/apache/spark/examples/SparkLR.scala --- @@ -47,12 +48,23 @@ object SparkLR { Array.tabulate(N)(generatePoint) } + def showWarning() { +System.err.println( + WARNING: THIS IS A NAIVE IMPLEMENTATION OF LOGISTIC REGRESSION AND IS GIVEN AS AN EXAMPLE! +|PLEASE USE THE LogisticRegression METHOD FOUND IN org.apache.spark.mllib.classification FOR +|MORE CONVENTIONAL USE + .stripMargin) + } + def main(args: Array[String]) { +showWarning() + val sparkConf = new SparkConf().setAppName(SparkLR) val sc = new SparkContext(sparkConf) val numSlices = if (args.length 0) args(0).toInt else 2 val points = sc.parallelize(generateData, numSlices).cache() + --- End diff -- remove extra empty line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][SPARK-2454] Do not assume drivers and ex...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1472#issuecomment-49666445 QA tests have started for PR 1472. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16923/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2567] Resubmitted stage sometimes remai...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1516#issuecomment-49666932 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2567] Resubmitted stage sometimes remai...
GitHub user tsudukim opened a pull request: https://github.com/apache/spark/pull/1516 [SPARK-2567] Resubmitted stage sometimes remains as active stage in the web UI Moved the line which post SparkListenerStageSubmitted to the back of check of tasks size and serializability. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tsudukim/spark feature/SPARK-2567 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1516.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1516 commit 79f4f98f5d08b7f52bb0216f2b412d959e64ad89 Author: Masayoshi TSUZUKI tsudu...@oss.nttdata.co.jp Date: 2014-07-21T21:05:42Z [SPARK-2567] Resubmitted stage sometimes remains as active stage in the web UI Moved the line which post SparkListenerStageSubmitted to the back of check of tasks size and serializability. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2470] PEP8 fixes to PySpark
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1505#discussion_r15197312 --- Diff: python/pyspark/statcounter.py --- @@ -124,5 +125,5 @@ def sampleStdev(self): return math.sqrt(self.sampleVariance()) def __repr__(self): -return (count: %s, mean: %s, stdev: %s, max: %s, min: %s) % (self.count(), self.mean(), self.stdev(), self.max(), self.min()) - +return (count: %s, mean: %s, stdev: %s, max: %s, min: %s) % --- End diff -- Here Need / and tab --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2470] PEP8 fixes to PySpark
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1505#discussion_r15197389 --- Diff: python/pyspark/statcounter.py --- @@ -124,5 +125,5 @@ def sampleStdev(self): return math.sqrt(self.sampleVariance()) def __repr__(self): -return (count: %s, mean: %s, stdev: %s, max: %s, min: %s) % (self.count(), self.mean(), self.stdev(), self.max(), self.min()) - +return (count: %s, mean: %s, stdev: %s, max: %s, min: %s) % +(self.count(), self.mean(), self.stdev(), self.max(), self.min()) --- End diff -- indent --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2470] PEP8 fixes to PySpark
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1505#discussion_r15197383 --- Diff: python/pyspark/statcounter.py --- @@ -124,5 +125,5 @@ def sampleStdev(self): return math.sqrt(self.sampleVariance()) def __repr__(self): -return (count: %s, mean: %s, stdev: %s, max: %s, min: %s) % (self.count(), self.mean(), self.stdev(), self.max(), self.min()) - +return (count: %s, mean: %s, stdev: %s, max: %s, min: %s) % --- End diff -- Need \ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix flakey HiveQuerySuite test
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1514#issuecomment-49667765 Thanks for the fix. Looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2470] PEP8 fixes to PySpark
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1505#discussion_r15197548 --- Diff: python/pyspark/shell.py --- @@ -35,7 +35,8 @@ from pyspark.storagelevel import StorageLevel # this is the equivalent of ADD_JARS -add_files = os.environ.get(ADD_FILES).split(',') if os.environ.get(ADD_FILES) is not None else None +add_files = ( +os.environ.get(ADD_FILES).split(',') if os.environ.get(ADD_FILES) is not None else None) --- End diff -- It's better break this line before if --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2470] PEP8 fixes to PySpark
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1505#discussion_r15197618 --- Diff: python/pyspark/serializers.py --- @@ -252,18 +251,20 @@ def load_stream(self, stream): yield pair def __eq__(self, other): -return isinstance(other, PairDeserializer) and \ - self.key_ser == other.key_ser and self.val_ser == other.val_ser +return isinstance(other, PairDeserializer) and +self.key_ser == other.key_ser and self.val_ser == other.val_ser --- End diff -- ?? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2470] PEP8 fixes to PySpark
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1505#discussion_r15197632 --- Diff: python/pyspark/serializers.py --- @@ -229,8 +228,8 @@ def load_stream(self, stream): yield pair def __eq__(self, other): -return isinstance(other, CartesianDeserializer) and \ - self.key_ser == other.key_ser and self.val_ser == other.val_ser +return isinstance(other, CartesianDeserializer) and +self.key_ser == other.key_ser and self.val_ser == other.val_ser --- End diff -- ?? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [YARN][SPARK-2606]:In some cases,the spark UI ...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/1501#issuecomment-49668001 I see, the environment variable / system property for `uiRoot` may not have been set yet by the time we load the page. LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2470] PEP8 fixes to PySpark
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1505#discussion_r15197643 --- Diff: python/pyspark/serializers.py --- @@ -197,8 +196,8 @@ def _load_stream_without_unbatching(self, stream): return self.serializer.load_stream(stream) def __eq__(self, other): -return isinstance(other, BatchedSerializer) and \ - other.serializer == self.serializer +return isinstance(other, BatchedSerializer) and +other.serializer == self.serializer --- End diff -- ?? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2609] Log thread ID when spilling Exter...
GitHub user andrewor14 opened a pull request: https://github.com/apache/spark/pull/1517 [SPARK-2609] Log thread ID when spilling ExternalAppendOnlyMap It's useful to know whether one thread is constantly spilling or multiple threads are spilling relatively infrequently. You can merge this pull request into a Git repository by running: $ git pull https://github.com/andrewor14/spark external-log Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1517.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1517 commit 90e48bb52ac52d79e8d1a2ffcba02c9852d85dfc Author: Andrew Or andrewo...@gmail.com Date: 2014-07-21T21:14:27Z Log thread ID when spilling --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2603][SQL] Remove unnecessary toMap and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1504#discussion_r15197742 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala --- @@ -239,9 +239,9 @@ private[sql] object JsonRDD extends Logging { // .map(identity) is used as a workaround of non-serializable Map // generated by .mapValues. // This issue is documented at https://issues.scala-lang.org/browse/SI-7005 - map.toMap.mapValues(scalafy).map(identity) + JMapWrapper(map).mapValues(scalafy).map(identity) --- End diff -- Should we be using the `.asScala` / `.asJava` methods in [`JavaConverters`](http://www.scala-lang.org/api/2.10.2/index.html#scala.collection.JavaConverters$) instead of creating these classes manually. It seems like thats more robust to changes in the Scala library, and they handle cases like going back and forth efficiently. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2609] Log thread ID when spilling Exter...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1517#issuecomment-49668361 QA tests have started for PR 1517. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16924/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2470] PEP8 fixes to PySpark
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1505#discussion_r15197750 --- Diff: python/pyspark/context.py --- @@ -192,15 +191,19 @@ def _ensure_initialized(cls, instance=None, gateway=None): SparkContext._writeToFile = SparkContext._jvm.PythonRDD.writeToFile if instance: -if SparkContext._active_spark_context and SparkContext._active_spark_context != instance: +if SparkContext._active_spark_context and +SparkContext._active_spark_context != instance: --- End diff -- ?? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---