[GitHub] spark pull request: SPARK-1338 [scalastyle] Ensure or disallow spa...

2014-07-21 Thread ScrapCodes
Github user ScrapCodes closed the pull request at:

https://github.com/apache/spark/pull/284


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Set configuration spark.history.retainedAppli...

2014-07-21 Thread XuTingjun
GitHub user XuTingjun opened a pull request:

https://github.com/apache/spark/pull/1509

Set configuration spark.history.retainedApplications be effective

When setting  spark.history.retainedApplications=1, the historyserver web 
 retains more than one application.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/XuTingjun/spark bug-fix1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1509.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1509


commit f4f1a41039bb8313e770394bcf9ddf8a3c0a4d66
Author: XuTingjun 1039320...@qq.com
Date:   2014-07-21T08:16:33Z

Update FsHistoryProvider.scala




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2497 Included checks for module symbols ...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1463#issuecomment-49581362
  
QA tests have started for PR 1463. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16910/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Set configuration spark.history.retainedAppli...

2014-07-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1509#issuecomment-49581609
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2565. Update ShuffleReadMetrics as block...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1507#issuecomment-49581839
  
QA results for PR 1507:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16901/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2103][Streaming] Change to ClassTag for...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1508#issuecomment-49581911
  
QA results for PR 1508:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16900/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with aggr...

2014-07-21 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1497#discussion_r15158553
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -152,6 +155,34 @@ class Analyzer(catalog: Catalog, registry: 
FunctionRegistry, caseSensitive: Bool
   }
 
   /**
+   * This rule finds expressions in HAVING clause filters that depend on
+   * unresolved attributes.  It pushes these expressions down to the 
underlying
+   * aggregates and then projects them away above the filter.
+   */
+  object UnresolvedHavingClauseAttributes extends Rule[LogicalPlan] {
+def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  case pl @ Filter(fexp, agg @ Aggregate(_, ae, _)) if 
!fexp.childrenResolved = {
+val alias = Alias(fexp, makeTmp())()
+val aggExprs = Seq(alias) ++ ae
+
+val newCond = EqualTo(Cast(alias.toAttribute, BooleanType), 
Literal(true, BooleanType))
+
+val newFilter = ResolveReferences(pl.copy(condition = newCond,
+  child = agg.copy(aggregateExpressions = aggExprs)))
+
+Project(pl.output, newFilter)
+  }
+}
+
+private val curId = new java.util.concurrent.atomic.AtomicLong()
+
+private def makeTmp() = {
+  val id = curId.getAndIncrement()
+  stmp_cond_$id
--- End diff --

[Some more details about 
NamedExpressions](http://people.apache.org/~marmbrus/docs/catalyst/#org.apache.spark.sql.catalyst.expressions.package)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1502#issuecomment-49582954
  
QA results for PR 1502:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):br* This trait extends Any to ensure it is universal (and thus 
compiled to a Java interface).brclass KVArraySortDataFormat[K, T : AnyRef : 
ClassTag] extends SortDataFormat[K, Array[T]] {brclass SorterK, Buffer 
{brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16902/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1502#issuecomment-49584160
  
QA results for PR 1502:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):br* This trait extends Any to ensure it is universal (and thus 
compiled to a Java interface).brclass KVArraySortDataFormat[K, T : AnyRef : 
ClassTag] extends SortDataFormat[K, Array[T]] {brclass SorterK, Buffer 
{brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16903/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-21 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1392#issuecomment-49584362
  
Hi @andrewor14 , that's ok. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2494] [PySpark] make hash of None consi...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1371#issuecomment-49585327
  
QA results for PR 1371:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16906/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1502#issuecomment-49585784
  
QA results for PR 1502:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):br* This trait extends Any to ensure it is universal (and thus 
compiled to a Java interface).brclass KVArraySortDataFormat[K, T : AnyRef : 
ClassTag] extends SortDataFormat[K, Array[T]] {brclass SorterK, Buffer 
{brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16905/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2549] Functions defined inside of other...

2014-07-21 Thread ScrapCodes
GitHub user ScrapCodes opened a pull request:

https://github.com/apache/spark/pull/1510

[SPARK-2549] Functions defined inside of other functions trigger failures



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ScrapCodes/spark-1 SPARK-2549/fun-in-fun

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1510.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1510


commit f24adc2b82ad77f64432c71350b88e8d34ffc8dc
Author: Prashant Sharma prashan...@imaginea.com
Date:   2014-07-21T09:56:56Z

SPARK-2549 Functions defined inside of other functions trigger failures




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2549] Functions defined inside of other...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1510#issuecomment-49589031
  
QA tests have started for PR 1510. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16911/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2497 Included checks for module symbols ...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1463#issuecomment-49589063
  
QA results for PR 1463:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16910/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2103][Streaming] Change to ClassTag for...

2014-07-21 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/1508#issuecomment-49590837
  
It's the MIMA test that fails, since the method signature is changed. It's 
possible to keep and deprecate the existing method of course. Should we just do 
that, or OK to remove the method on the grounds that the API doesn't quite work?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2549] Functions defined inside of other...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1510#issuecomment-49595816
  
QA results for PR 1510:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16911/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-07-21 Thread jayunit100
Github user jayunit100 commented on the pull request:

https://github.com/apache/spark/pull/1254#issuecomment-49602573
  
Adding the Drop function to a contrib library of functions (which requires 
manual import) , as erik suggests, seems like a really good option.  I could 
see such a contrib library also being useful for other isoteric but 
nevertheless important tasks, like dealing with binary data formats, etc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [YARN] SPARK-2577: File upload to viewfs is br...

2014-07-21 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/1483#issuecomment-49605112
  
From my understanding when using viewfs: addDelegationTokens is supposed to 
get tokens for all the underlying filesystems so it should already have a token 
for it. @gerashegalov did you test this on secure cluster?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [YARN]In some cases, pages display incorrect i...

2014-07-21 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/1501#issuecomment-49605790
  
@witgo, nice catch, can you please file jira and put details of issue. This 
looks good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix NPE for JsonProtocol

2014-07-21 Thread witgo
GitHub user witgo opened a pull request:

https://github.com/apache/spark/pull/1511

Fix NPE for JsonProtocol



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/witgo/spark JsonProtocol

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1511.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1511


commit b187cfe8dd1a7ce047f3a2b0e5f43da0630eec83
Author: GuoQiang Li wi...@qq.com
Date:   2014-07-21T14:41:32Z

Fix NPE for JsonProtocol




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with aggr...

2014-07-21 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1497#discussion_r15172123
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -152,6 +155,34 @@ class Analyzer(catalog: Catalog, registry: 
FunctionRegistry, caseSensitive: Bool
   }
 
   /**
+   * This rule finds expressions in HAVING clause filters that depend on
+   * unresolved attributes.  It pushes these expressions down to the 
underlying
+   * aggregates and then projects them away above the filter.
+   */
+  object UnresolvedHavingClauseAttributes extends Rule[LogicalPlan] {
+def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  case pl @ Filter(fexp, agg @ Aggregate(_, ae, _)) if 
!fexp.childrenResolved = {
+val alias = Alias(fexp, makeTmp())()
+val aggExprs = Seq(alias) ++ ae
+
+val newCond = EqualTo(Cast(alias.toAttribute, BooleanType), 
Literal(true, BooleanType))
+
+val newFilter = ResolveReferences(pl.copy(condition = newCond,
+  child = agg.copy(aggregateExpressions = aggExprs)))
+
+Project(pl.output, newFilter)
+  }
+}
+
+private val curId = new java.util.concurrent.atomic.AtomicLong()
+
+private def makeTmp() = {
+  val id = curId.getAndIncrement()
+  stmp_cond_$id
--- End diff --

@marmbrus Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1680: use configs for specifying environ...

2014-07-21 Thread tgravescs
GitHub user tgravescs opened a pull request:

https://github.com/apache/spark/pull/1512

SPARK-1680: use configs for specifying environment variables on YARN

Note that this also documents spark.executorEnv.*  which to me means its 
public.  If we don't want that please speak up.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tgravescs/spark SPARK-1680

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1512.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1512


commit 0b6c045fde1410ef626dc6beff39795467e90079
Author: Thomas Graves tgra...@apache.org
Date:   2014-07-21T14:44:21Z

use configs for specifying environment variables on YARN




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1680: use configs for specifying environ...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1512#issuecomment-49614966
  
QA tests have started for PR 1512. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16912/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix NPE for JsonProtocol

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1511#issuecomment-49614971
  
QA tests have started for PR 1511. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16913/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2555] Support configuration spark.sched...

2014-07-21 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/1462#issuecomment-49616178
  
cc @kayousterhout as I think she is more familiar with standalone mode and 
scheduler details.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1707. Remove unnecessary 3 second sleep ...

2014-07-21 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/634#discussion_r15173511
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
 ---
@@ -30,6 +30,11 @@ private[spark] class YarnClientSchedulerBackend(
   extends CoarseGrainedSchedulerBackend(scheduler, sc.env.actorSystem)
   with Logging {
 
+  if 
(conf.getOption(spark.scheduler.minRegisteredExecutorsRatio).isEmpty) {
+minRegisteredRatio = 0.8
--- End diff --

No real reason other then it might take longer to get 100%.  Its just kind 
of a number we choose to hopefully give the user a good experience without 
having to wait to long if the cluster is really busy.   The user can change it 
if they want.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1707. Remove unnecessary 3 second sleep ...

2014-07-21 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/634#discussion_r15173596
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientClusterScheduler.scala
 ---
@@ -37,14 +37,4 @@ private[spark] class YarnClientClusterScheduler(sc: 
SparkContext, conf: Configur
 val retval = YarnAllocationHandler.lookupRack(conf, host)
 if (retval != null) Some(retval) else None
   }
-
-  override def postStartHook() {
-
-super.postStartHook()
--- End diff --

YarnClientClusterScheduler extends TaskSchedulerImpl, so it will just fall 
through to TaskSchedulerImpl.postStartHook() which calls waitBackendReady


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Spark 2557] fix LOCAL_N_REGEX in createTaskSc...

2014-07-21 Thread advancedxy
Github user advancedxy commented on the pull request:

https://github.com/apache/spark/pull/1464#issuecomment-49617743
  
Well, could someone review this pr? Or should I close this pr?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2150: Provide direct link to finished ap...

2014-07-21 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/1094#discussion_r15177387
  
--- Diff: 
yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ExecutorLauncher.scala 
---
@@ -289,7 +289,7 @@ class ExecutorLauncher(args: 
ApplicationMasterArguments, conf: Configuration, sp
   .asInstanceOf[FinishApplicationMasterRequest]
 finishReq.setAppAttemptId(appAttemptId)
 finishReq.setFinishApplicationStatus(status)
-
finishReq.setTrackingUrl(sparkConf.get(spark.yarn.historyServer.address, ))
+
finishReq.setTrackingUrl(sparkConf.get(spark.driver.appUIHistoryAddress, ))
--- End diff --

typo, need the other  in the default case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2608] fix executor backend launch commo...

2014-07-21 Thread scwf
GitHub user scwf opened a pull request:

https://github.com/apache/spark/pull/1513

[SPARK-2608] fix executor backend launch commond over mesos mode

mesos scheduler backend use spark-class/spark-executor to launch executor 
backend, this will lead to problems:
1 when set spark.executor.extraJavaOptions CoarseMesosSchedulerBackend will 
throw errors because of the launch command ./bin/spark-class 
org.apache.spark.executor.CoarseGrainedExecutorBackend %s %s %s %s 
%d).format(basename, extraOpts, driverUrl, 
offer.getSlaveId.getValue,offer.getHostname, numCores))

2 spark.executor.extraJavaOptions and spark.executor.extraLibraryPath set 
in sparkconf will not be valid

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/scwf/spark mesosfix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1513.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1513


commit 5b150a481a1ddb3dee4e94e14e737f2d20313f5c
Author: scwf wangfei1.huawei.com
Date:   2014-07-21T15:23:59Z

fix executor backend launch commond over mesos mode

commit fdc3cb1efa931780f652dcb4d2cda0304ca50710
Author: scwf wangfei1.huawei.com
Date:   2014-07-21T15:32:09Z

fix code format




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2608] fix executor backend launch commo...

2014-07-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1513#issuecomment-49627746
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix NPE for JsonProtocol

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1511#issuecomment-49629586
  
QA results for PR 1511:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16913/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1680: use configs for specifying environ...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1512#issuecomment-49629675
  
QA results for PR 1512:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16912/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2494] [PySpark] make hash of None consi...

2014-07-21 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1371#issuecomment-49629961
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2494] [PySpark] make hash of None consi...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1371#issuecomment-49630747
  
QA tests have started for PR 1371. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16914/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix flakey HiveQuerySuite test

2014-07-21 Thread aarondav
GitHub user aarondav opened a pull request:

https://github.com/apache/spark/pull/1514

Fix flakey HiveQuerySuite test

Result may not be returned in the expected order, so relax that constraint.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aarondav/spark flakey

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1514.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1514


commit e5af823eef0fb42c5acbd071ddb9453534a3fd0c
Author: Aaron Davidson aa...@databricks.com
Date:   2014-07-21T16:58:36Z

Fix flakey HiveQuerySuite test

Result may not be returned in the expected order, so relax that constraint.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

2014-07-21 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/1502#issuecomment-49634069
  
(This PR passed Jenkins 3 times and then failed inside HiveContext -- it's 
probably OK. I submitted https://github.com/apache/spark/pull/1514 to fix the 
flakey test.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix flakey HiveQuerySuite test

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1514#issuecomment-49634232
  
QA tests have started for PR 1514. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16915/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1680: use configs for specifying environ...

2014-07-21 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/1512#issuecomment-49634254
  
I need to update so the documentation shows up properly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1680: use configs for specifying environ...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1512#issuecomment-49635472
  
QA tests have started for PR 1512. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16916/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1680: use configs for specifying environ...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1512#issuecomment-49636071
  
QA tests have started for PR 1512. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16917/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1707. Remove unnecessary 3 second sleep ...

2014-07-21 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/634#issuecomment-49643611
  
Looks good. +1, Thanks @sryza.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2494] [PySpark] make hash of None consi...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1371#issuecomment-49643677
  
QA results for PR 1371:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16914/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2555] Support configuration spark.sched...

2014-07-21 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/1462#issuecomment-49643818
  
If we change the name of the config you'll need to upmerge as 
https://github.com/apache/spark/pull/634 set some defaults on the yarn side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1707. Remove unnecessary 3 second sleep ...

2014-07-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/634


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...

2014-07-21 Thread brkyvz
GitHub user brkyvz opened a pull request:

https://github.com/apache/spark/pull/1515

[SPARK-2434][MLlib]: Warning messages that point users to original MLlib 
implementations added to Examples

[SPARK-2434][MLlib]: Warning messages that refer users to the original 
MLlib implementations of some popular example machine learning algorithms added 
both in the comments and the code. The following examples have been modified:
Scala:
* LocalALS
* LocalFileLR
* LocalKMeans
* LocalLP
* SparkALS
* SparkHdfsLR
* SparkKMeans
* SparkLR
Python:
 * kmeans.py
 * als.py
 * logistic_regression.py

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/brkyvz/spark SPARK-2434

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1515.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1515


commit 2cb53011985161307a1064982ae26b6fe2e69130
Author: Burak brk...@gmail.com
Date:   2014-07-17T01:06:52Z

SPARK-2434: Warning messages redirecting to original implementaions added.

commit 17d3d836de4c3441d37afca14567cc31a90fd1cc
Author: Burak brk...@gmail.com
Date:   2014-07-17T17:02:32Z

SPARK-2434: Added warning messages to the naive implementations of the 
example algorithms

commit 4762f3930659a6303304021b6ede23b9489bb2f6
Author: Burak brk...@gmail.com
Date:   2014-07-18T00:14:19Z

[SPARK-2434]: Warning messages added

commit b6c35b783c80efee9a8a9f4db64ad6a5210455d8
Author: brkyvz brk...@gmail.com
Date:   2014-07-21T17:34:40Z

[SPARK-2434]: Warning messages added

commit 5f84e4be6c07230afd11d255ce88ca0affc0677f
Author: brkyvz brk...@gmail.com
Date:   2014-07-21T18:18:00Z

[SPARK-2434]: Warning messages added




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...

2014-07-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1515#issuecomment-49645784
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2494] [PySpark] make hash of None consi...

2014-07-21 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/1371#issuecomment-49646830
  
The JVM fork one python daemon(daemon.py), then the daemon fork all the 
workers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix flakey HiveQuerySuite test

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1514#issuecomment-49647027
  
QA results for PR 1514:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16915/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1460#issuecomment-49647944
  
QA tests have started for PR 1460. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16918/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2494] [PySpark] make hash of None consi...

2014-07-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1371


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1680: use configs for specifying environ...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1512#issuecomment-49649190
  
QA results for PR 1512:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16917/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2494] [PySpark] make hash of None consi...

2014-07-21 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1371#issuecomment-49650014
  
Ah right, that makes sense. I've merged this in now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1460#discussion_r15188798
  
--- Diff: python/pyspark/rdd.py ---
@@ -168,6 +169,18 @@ def _replaceRoot(self, value):
 self._sink(1)
 
 
+def _parse_memory(s):
--- End diff --

Add a comment to this saying it returns a number in MB


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2269 Refactor mesos scheduler resourceOf...

2014-07-21 Thread tnachen
Github user tnachen commented on the pull request:

https://github.com/apache/spark/pull/1487#issuecomment-49651731
  
@pwendell The console said the test failed but in a very unrelated way, is 
CI failing in general?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-07-21 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-49655673
  
@akopich Thanks for working on PLSA! This is a big feature and it 
introduces many public traits/classes. Could you please summarize the public 
methods? Some of them may be unnecessary to expose to end users and we should 
hide them. The other issue is about the code style. Please follow the guide at 
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide and 
update the PR, for example:

1. remove created by ... comments generated by intellij
2. organize imports into groups: java, scala, 3rd party, and spark
3. doc for every trait/class. Some only contain doc for parameters but miss 
the summary.
4. indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1460#issuecomment-49658257
  
QA tests have started for PR 1460. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16919/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1499#discussion_r15192915
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ---
@@ -0,0 +1,573 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import java.io._
+import java.util.Comparator
+
+import scala.collection.mutable.ArrayBuffer
+import scala.collection.mutable
+
+import com.google.common.io.ByteStreams
+
+import org.apache.spark.{Aggregator, SparkEnv, Logging, Partitioner}
+import org.apache.spark.serializer.Serializer
+import org.apache.spark.storage.BlockId
+
+/**
+ * Sorts and potentially merges a number of key-value pairs of type (K, V) 
to produce key-combiner
+ * pairs of type (K, C). Uses a Partitioner to first group the keys into 
partitions, and then
+ * optionally sorts keys within each partition using a custom Comparator. 
Can output a single
+ * partitioned file with a different byte range for each partition, 
suitable for shuffle fetches.
+ *
+ * If combining is disabled, the type C must equal V -- we'll cast the 
objects at the end.
+ *
+ * @param aggregator optional Aggregator with combine functions to use for 
merging data
+ * @param partitioner optional partitioner; if given, sort by partition ID 
and then key
+ * @param ordering optional ordering to sort keys within each partition
+ * @param serializer serializer to use when spilling to disk
+ */
+private[spark] class ExternalSorter[K, V, C](
+aggregator: Option[Aggregator[K, V, C]] = None,
+partitioner: Option[Partitioner] = None,
+ordering: Option[Ordering[K]] = None,
+serializer: Option[Serializer] = None) extends Logging {
+
+  private val numPartitions = partitioner.map(_.numPartitions).getOrElse(1)
+  private val shouldPartition = numPartitions  1
+
+  private val blockManager = SparkEnv.get.blockManager
+  private val diskBlockManager = blockManager.diskBlockManager
+  private val ser = Serializer.getSerializer(serializer)
+  private val serInstance = ser.newInstance()
+
+  private val conf = SparkEnv.get.conf
+  private val fileBufferSize = conf.getInt(spark.shuffle.file.buffer.kb, 
100) * 1024
+  private val serializerBatchSize = 
conf.getLong(spark.shuffle.spill.batchSize, 1)
+
+  private def getPartition(key: K): Int = {
+if (shouldPartition) partitioner.get.getPartition(key) else 0
+  }
+
+  // Data structures to store in-memory objects before we spill. Depending 
on whether we have an
+  // Aggregator set, we either put objects into an AppendOnlyMap where we 
combine them, or we
+  // store them in an array buffer.
+  var map = new SizeTrackingAppendOnlyMap[(Int, K), C]
+  var buffer = new SizeTrackingBuffer[((Int, K), C)]
+
+  // Track how many elements we've read before we try to estimate memory. 
Ideally we'd use
+  // map.size or buffer.size for this, but because users' Aggregators can 
potentially increase
+  // the size of a merged element when we add values with the same key, 
it's safer to track
+  // elements read from the input iterator.
+  private var elementsRead = 0L
+  private val trackMemoryThreshold = 1000
+
+  // Spilling statistics
+  private var spillCount = 0
+  private var _memoryBytesSpilled = 0L
+  private var _diskBytesSpilled = 0L
+
+  // Collective memory threshold shared across all running tasks
+  private val maxMemoryThreshold = {
+val memoryFraction = conf.getDouble(spark.shuffle.memoryFraction, 
0.3)
+val safetyFraction = conf.getDouble(spark.shuffle.safetyFraction, 
0.8)
+(Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction).toLong
+  }
+
+  // A comparator for keys K that orders them within a partition to allow 
partial aggregation.
+  // Can be a partial ordering by hash code if a total ordering is not 
provided through by the
  

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1499#discussion_r15192894
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ---
@@ -0,0 +1,573 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import java.io._
+import java.util.Comparator
+
+import scala.collection.mutable.ArrayBuffer
+import scala.collection.mutable
+
+import com.google.common.io.ByteStreams
+
+import org.apache.spark.{Aggregator, SparkEnv, Logging, Partitioner}
+import org.apache.spark.serializer.Serializer
+import org.apache.spark.storage.BlockId
+
+/**
+ * Sorts and potentially merges a number of key-value pairs of type (K, V) 
to produce key-combiner
+ * pairs of type (K, C). Uses a Partitioner to first group the keys into 
partitions, and then
+ * optionally sorts keys within each partition using a custom Comparator. 
Can output a single
+ * partitioned file with a different byte range for each partition, 
suitable for shuffle fetches.
+ *
+ * If combining is disabled, the type C must equal V -- we'll cast the 
objects at the end.
+ *
+ * @param aggregator optional Aggregator with combine functions to use for 
merging data
+ * @param partitioner optional partitioner; if given, sort by partition ID 
and then key
+ * @param ordering optional ordering to sort keys within each partition
+ * @param serializer serializer to use when spilling to disk
+ */
+private[spark] class ExternalSorter[K, V, C](
+aggregator: Option[Aggregator[K, V, C]] = None,
+partitioner: Option[Partitioner] = None,
+ordering: Option[Ordering[K]] = None,
+serializer: Option[Serializer] = None) extends Logging {
+
+  private val numPartitions = partitioner.map(_.numPartitions).getOrElse(1)
+  private val shouldPartition = numPartitions  1
+
+  private val blockManager = SparkEnv.get.blockManager
+  private val diskBlockManager = blockManager.diskBlockManager
+  private val ser = Serializer.getSerializer(serializer)
+  private val serInstance = ser.newInstance()
+
+  private val conf = SparkEnv.get.conf
+  private val fileBufferSize = conf.getInt(spark.shuffle.file.buffer.kb, 
100) * 1024
+  private val serializerBatchSize = 
conf.getLong(spark.shuffle.spill.batchSize, 1)
+
+  private def getPartition(key: K): Int = {
+if (shouldPartition) partitioner.get.getPartition(key) else 0
+  }
+
+  // Data structures to store in-memory objects before we spill. Depending 
on whether we have an
+  // Aggregator set, we either put objects into an AppendOnlyMap where we 
combine them, or we
+  // store them in an array buffer.
+  var map = new SizeTrackingAppendOnlyMap[(Int, K), C]
+  var buffer = new SizeTrackingBuffer[((Int, K), C)]
+
+  // Track how many elements we've read before we try to estimate memory. 
Ideally we'd use
+  // map.size or buffer.size for this, but because users' Aggregators can 
potentially increase
+  // the size of a merged element when we add values with the same key, 
it's safer to track
+  // elements read from the input iterator.
+  private var elementsRead = 0L
+  private val trackMemoryThreshold = 1000
+
+  // Spilling statistics
+  private var spillCount = 0
+  private var _memoryBytesSpilled = 0L
+  private var _diskBytesSpilled = 0L
+
+  // Collective memory threshold shared across all running tasks
+  private val maxMemoryThreshold = {
+val memoryFraction = conf.getDouble(spark.shuffle.memoryFraction, 
0.3)
+val safetyFraction = conf.getDouble(spark.shuffle.safetyFraction, 
0.8)
+(Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction).toLong
+  }
+
+  // A comparator for keys K that orders them within a partition to allow 
partial aggregation.
+  // Can be a partial ordering by hash code if a total ordering is not 
provided through by the
  

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1460#issuecomment-49660273
  
QA results for PR 1460:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brclass AutoSerializer(FramedSerializer):brclass 
Merger(object):brclass MapMerger(Merger):brclass 
ExternalHashMapMerger(Merger):brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16918/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1499#discussion_r15193928
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ---
@@ -0,0 +1,573 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import java.io._
+import java.util.Comparator
+
+import scala.collection.mutable.ArrayBuffer
+import scala.collection.mutable
+
+import com.google.common.io.ByteStreams
+
+import org.apache.spark.{Aggregator, SparkEnv, Logging, Partitioner}
+import org.apache.spark.serializer.Serializer
+import org.apache.spark.storage.BlockId
+
+/**
+ * Sorts and potentially merges a number of key-value pairs of type (K, V) 
to produce key-combiner
+ * pairs of type (K, C). Uses a Partitioner to first group the keys into 
partitions, and then
+ * optionally sorts keys within each partition using a custom Comparator. 
Can output a single
+ * partitioned file with a different byte range for each partition, 
suitable for shuffle fetches.
+ *
+ * If combining is disabled, the type C must equal V -- we'll cast the 
objects at the end.
+ *
+ * @param aggregator optional Aggregator with combine functions to use for 
merging data
+ * @param partitioner optional partitioner; if given, sort by partition ID 
and then key
+ * @param ordering optional ordering to sort keys within each partition
+ * @param serializer serializer to use when spilling to disk
+ */
+private[spark] class ExternalSorter[K, V, C](
+aggregator: Option[Aggregator[K, V, C]] = None,
+partitioner: Option[Partitioner] = None,
+ordering: Option[Ordering[K]] = None,
+serializer: Option[Serializer] = None) extends Logging {
+
+  private val numPartitions = partitioner.map(_.numPartitions).getOrElse(1)
+  private val shouldPartition = numPartitions  1
+
+  private val blockManager = SparkEnv.get.blockManager
+  private val diskBlockManager = blockManager.diskBlockManager
+  private val ser = Serializer.getSerializer(serializer)
+  private val serInstance = ser.newInstance()
+
+  private val conf = SparkEnv.get.conf
+  private val fileBufferSize = conf.getInt(spark.shuffle.file.buffer.kb, 
100) * 1024
+  private val serializerBatchSize = 
conf.getLong(spark.shuffle.spill.batchSize, 1)
+
+  private def getPartition(key: K): Int = {
+if (shouldPartition) partitioner.get.getPartition(key) else 0
+  }
+
+  // Data structures to store in-memory objects before we spill. Depending 
on whether we have an
+  // Aggregator set, we either put objects into an AppendOnlyMap where we 
combine them, or we
+  // store them in an array buffer.
+  var map = new SizeTrackingAppendOnlyMap[(Int, K), C]
+  var buffer = new SizeTrackingBuffer[((Int, K), C)]
+
+  // Track how many elements we've read before we try to estimate memory. 
Ideally we'd use
+  // map.size or buffer.size for this, but because users' Aggregators can 
potentially increase
+  // the size of a merged element when we add values with the same key, 
it's safer to track
+  // elements read from the input iterator.
+  private var elementsRead = 0L
+  private val trackMemoryThreshold = 1000
+
+  // Spilling statistics
+  private var spillCount = 0
+  private var _memoryBytesSpilled = 0L
+  private var _diskBytesSpilled = 0L
+
+  // Collective memory threshold shared across all running tasks
+  private val maxMemoryThreshold = {
+val memoryFraction = conf.getDouble(spark.shuffle.memoryFraction, 
0.3)
+val safetyFraction = conf.getDouble(spark.shuffle.safetyFraction, 
0.8)
+(Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction).toLong
+  }
+
+  // A comparator for keys K that orders them within a partition to allow 
partial aggregation.
+  // Can be a partial ordering by hash code if a total ordering is not 
provided through by the
  

[GitHub] spark pull request: [SPARK-2555] Support configuration spark.sched...

2014-07-21 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/1462#issuecomment-49661602
  
@tgravescs I actually mentioned this race condition in the previous PR: 
https://github.com/apache/spark/pull/900#diff-for-comment-14205738 . In the 
future we should try to be more careful about merging things that have 
un-replied to comments (I'm about to send an email to the dev list about this).

@li-zhihui if someone points out a problem in a pull request you submit, 
the expectation is that it will be fixed when you reply to the comment.  Can 
you please submit a new pull request that fixes the race condition with 
standalone mode, before we proceed with adding this functionality to Mesos mode?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-21 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/1165#discussion_r15194471
  
--- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
@@ -140,14 +145,36 @@ private[spark] class CacheManager(blockManager: 
BlockManager) extends Logging {
   throw new BlockException(key, sBlock manager failed to return 
cached value for $key!)
   }
 } else {
-  /* This RDD is to be cached in memory. In this case we cannot pass 
the computed values
+  /*
+   * This RDD is to be cached in memory. In this case we cannot pass 
the computed values
* to the BlockManager as an iterator and expect to read it back 
later. This is because
-   * we may end up dropping a partition from memory store before 
getting it back, e.g.
-   * when the entirety of the RDD does not fit in memory. */
-  val elements = new ArrayBuffer[Any]
-  elements ++= values
-  updatedBlocks ++= blockManager.put(key, elements, storageLevel, 
tellMaster = true)
-  elements.iterator.asInstanceOf[Iterator[T]]
+   * we may end up dropping a partition from memory store before 
getting it back.
+   *
+   * In addition, we must be careful to not unroll the entire 
partition in memory at once.
+   * Otherwise, we may cause an OOM exception if the JVM does not have 
enough space for this
+   * single partition. Instead, we unroll the values cautiously, 
potentially aborting and
+   * dropping the partition to disk if applicable.
+   */
+  blockManager.memoryStore.unrollSafely(key, values, updatedBlocks) 
match {
+case Left(arr) =
+  // We have successfully unrolled the entire partition, so cache 
it in memory
+  updatedBlocks ++=
+blockManager.putArray(key, arr, level, tellMaster = true, 
effectiveStorageLevel)
+  arr.iterator.asInstanceOf[Iterator[T]]
+case Right(it) =
+  // There is not enough space to cache this partition in memory
+  logWarning(sNot enough space to cache $key in memory!  +
+sFree memory is ${blockManager.memoryStore.freeMemory} 
bytes.)
+  var returnValues = it.asInstanceOf[Iterator[T]]
+  if (putLevel.useDisk) {
+logWarning(sPersisting $key to disk instead.)
+val diskOnlyLevel = StorageLevel(useDisk = true, useMemory = 
false,
+  useOffHeap = false, deserialized = false, 
putLevel.replication)
+returnValues =
+  putInBlockManager[T](key, returnValues, level, 
updatedBlocks, Some(diskOnlyLevel))
+  }
+  returnValues
+  }
--- End diff --

Yes. The existing BM interface does not return the values you just put (for 
good reasons), but once we add it, this entire method can be simplified to a 
wrapper, and there won't be duplicate logic between here and the `get` from 
disk case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1499#discussion_r15194652
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ---
@@ -0,0 +1,573 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import java.io._
+import java.util.Comparator
+
+import scala.collection.mutable.ArrayBuffer
+import scala.collection.mutable
+
+import com.google.common.io.ByteStreams
+
+import org.apache.spark.{Aggregator, SparkEnv, Logging, Partitioner}
+import org.apache.spark.serializer.Serializer
+import org.apache.spark.storage.BlockId
+
+/**
+ * Sorts and potentially merges a number of key-value pairs of type (K, V) 
to produce key-combiner
+ * pairs of type (K, C). Uses a Partitioner to first group the keys into 
partitions, and then
+ * optionally sorts keys within each partition using a custom Comparator. 
Can output a single
+ * partitioned file with a different byte range for each partition, 
suitable for shuffle fetches.
+ *
+ * If combining is disabled, the type C must equal V -- we'll cast the 
objects at the end.
+ *
+ * @param aggregator optional Aggregator with combine functions to use for 
merging data
+ * @param partitioner optional partitioner; if given, sort by partition ID 
and then key
+ * @param ordering optional ordering to sort keys within each partition
+ * @param serializer serializer to use when spilling to disk
+ */
+private[spark] class ExternalSorter[K, V, C](
+aggregator: Option[Aggregator[K, V, C]] = None,
+partitioner: Option[Partitioner] = None,
+ordering: Option[Ordering[K]] = None,
+serializer: Option[Serializer] = None) extends Logging {
+
+  private val numPartitions = partitioner.map(_.numPartitions).getOrElse(1)
+  private val shouldPartition = numPartitions  1
+
+  private val blockManager = SparkEnv.get.blockManager
+  private val diskBlockManager = blockManager.diskBlockManager
+  private val ser = Serializer.getSerializer(serializer)
+  private val serInstance = ser.newInstance()
+
+  private val conf = SparkEnv.get.conf
+  private val fileBufferSize = conf.getInt(spark.shuffle.file.buffer.kb, 
100) * 1024
+  private val serializerBatchSize = 
conf.getLong(spark.shuffle.spill.batchSize, 1)
+
+  private def getPartition(key: K): Int = {
+if (shouldPartition) partitioner.get.getPartition(key) else 0
+  }
+
+  // Data structures to store in-memory objects before we spill. Depending 
on whether we have an
+  // Aggregator set, we either put objects into an AppendOnlyMap where we 
combine them, or we
+  // store them in an array buffer.
+  var map = new SizeTrackingAppendOnlyMap[(Int, K), C]
+  var buffer = new SizeTrackingBuffer[((Int, K), C)]
+
+  // Track how many elements we've read before we try to estimate memory. 
Ideally we'd use
+  // map.size or buffer.size for this, but because users' Aggregators can 
potentially increase
+  // the size of a merged element when we add values with the same key, 
it's safer to track
+  // elements read from the input iterator.
+  private var elementsRead = 0L
+  private val trackMemoryThreshold = 1000
+
+  // Spilling statistics
+  private var spillCount = 0
+  private var _memoryBytesSpilled = 0L
+  private var _diskBytesSpilled = 0L
+
+  // Collective memory threshold shared across all running tasks
+  private val maxMemoryThreshold = {
+val memoryFraction = conf.getDouble(spark.shuffle.memoryFraction, 
0.3)
+val safetyFraction = conf.getDouble(spark.shuffle.safetyFraction, 
0.8)
+(Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction).toLong
+  }
+
+  // A comparator for keys K that orders them within a partition to allow 
partial aggregation.
+  // Can be a partial ordering by hash code if a total ordering is not 
provided through by the
  

[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-21 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/1165#discussion_r15194738
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/SizeTrackingAppendOnlyBuffer.scala
 ---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import scala.reflect.ClassTag
+
+/**
+ * An append-only buffer that keeps track of its estimated size in bytes.
+ */
+private[spark] class SizeTrackingAppendOnlyBuffer[T: ClassTag]
+  extends PrimitiveVector[T]
+  with SizeTracker {
+
+  override def +=(value: T): Unit = {
+super.+=(value)
+super.afterUpdate()
+  }
+
+  override def resize(newLength: Int): PrimitiveVector[T] = {
+super.resize(newLength)
+resetSamples()
+this
+  }
+
+  override def array: Array[T] = {
--- End diff --

It was called `array` because this overrides `PrimitiveVector#array`, which 
returns the untrimmed version. It might be a little confusing for this class to 
have both a `toArray` method and an `array` method that do different things


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-21 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/1165#discussion_r15195205
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -463,16 +463,15 @@ private[spark] class BlockManager(
   val values = dataDeserialize(blockId, bytes)
   if (level.deserialized) {
 // Cache the values before returning them
-// TODO: Consider creating a putValues that also takes in 
a iterator?
-val valuesBuffer = new ArrayBuffer[Any]
-valuesBuffer ++= values
-memoryStore.putValues(blockId, valuesBuffer, level, 
returnValues = true).data
-  match {
-case Left(values2) =
-  return Some(new BlockResult(values2, 
DataReadMethod.Disk, info.size))
-case _ =
-  throw new SparkException(Memory store did not 
return back an iterator)
-  }
+val putResult = memoryStore.putValues(
+  blockId, values, level, returnValues = true, 
allowPersistToDisk = false)
+putResult.data match {
+  case Left(it) =
+return Some(new BlockResult(it, DataReadMethod.Disk, 
info.size))
+  case _ =
+// This only happens if we dropped the values back to 
disk (which is never)
+throw new SparkException(Memory store did not return 
an iterator!)
+}
--- End diff --

Ok, I added it in my latest commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1499#discussion_r15195282
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ---
@@ -0,0 +1,573 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import java.io._
+import java.util.Comparator
+
+import scala.collection.mutable.ArrayBuffer
+import scala.collection.mutable
+
+import com.google.common.io.ByteStreams
+
+import org.apache.spark.{Aggregator, SparkEnv, Logging, Partitioner}
+import org.apache.spark.serializer.Serializer
+import org.apache.spark.storage.BlockId
+
+/**
+ * Sorts and potentially merges a number of key-value pairs of type (K, V) 
to produce key-combiner
+ * pairs of type (K, C). Uses a Partitioner to first group the keys into 
partitions, and then
+ * optionally sorts keys within each partition using a custom Comparator. 
Can output a single
+ * partitioned file with a different byte range for each partition, 
suitable for shuffle fetches.
+ *
+ * If combining is disabled, the type C must equal V -- we'll cast the 
objects at the end.
+ *
+ * @param aggregator optional Aggregator with combine functions to use for 
merging data
+ * @param partitioner optional partitioner; if given, sort by partition ID 
and then key
+ * @param ordering optional ordering to sort keys within each partition
+ * @param serializer serializer to use when spilling to disk
+ */
+private[spark] class ExternalSorter[K, V, C](
+aggregator: Option[Aggregator[K, V, C]] = None,
+partitioner: Option[Partitioner] = None,
+ordering: Option[Ordering[K]] = None,
+serializer: Option[Serializer] = None) extends Logging {
+
+  private val numPartitions = partitioner.map(_.numPartitions).getOrElse(1)
+  private val shouldPartition = numPartitions  1
+
+  private val blockManager = SparkEnv.get.blockManager
+  private val diskBlockManager = blockManager.diskBlockManager
+  private val ser = Serializer.getSerializer(serializer)
+  private val serInstance = ser.newInstance()
+
+  private val conf = SparkEnv.get.conf
+  private val fileBufferSize = conf.getInt(spark.shuffle.file.buffer.kb, 
100) * 1024
+  private val serializerBatchSize = 
conf.getLong(spark.shuffle.spill.batchSize, 1)
+
+  private def getPartition(key: K): Int = {
+if (shouldPartition) partitioner.get.getPartition(key) else 0
+  }
+
+  // Data structures to store in-memory objects before we spill. Depending 
on whether we have an
+  // Aggregator set, we either put objects into an AppendOnlyMap where we 
combine them, or we
+  // store them in an array buffer.
+  var map = new SizeTrackingAppendOnlyMap[(Int, K), C]
+  var buffer = new SizeTrackingBuffer[((Int, K), C)]
+
+  // Track how many elements we've read before we try to estimate memory. 
Ideally we'd use
+  // map.size or buffer.size for this, but because users' Aggregators can 
potentially increase
+  // the size of a merged element when we add values with the same key, 
it's safer to track
+  // elements read from the input iterator.
+  private var elementsRead = 0L
+  private val trackMemoryThreshold = 1000
+
+  // Spilling statistics
+  private var spillCount = 0
+  private var _memoryBytesSpilled = 0L
+  private var _diskBytesSpilled = 0L
+
+  // Collective memory threshold shared across all running tasks
+  private val maxMemoryThreshold = {
+val memoryFraction = conf.getDouble(spark.shuffle.memoryFraction, 
0.3)
+val safetyFraction = conf.getDouble(spark.shuffle.safetyFraction, 
0.8)
+(Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction).toLong
+  }
+
+  // A comparator for keys K that orders them within a partition to allow 
partial aggregation.
+  // Can be a partial ordering by hash code if a total ordering is not 
provided through by the
  

[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-21 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/1165#discussion_r15195362
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -561,13 +562,14 @@ private[spark] class BlockManager(
 iter
   }
 
-  def put(
+  def putIterator(
--- End diff --

This is renamed to avoid method signature conflicts with the new 
`putArray`, since this PR introduced a new optional parameter 
`effectiveStorageLevel`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2086] Improve output of toDebugString t...

2014-07-21 Thread GregOwen
Github user GregOwen commented on the pull request:

https://github.com/apache/spark/pull/1364#issuecomment-49664457
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1499#discussion_r15196170
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ---
@@ -0,0 +1,573 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import java.io._
+import java.util.Comparator
+
+import scala.collection.mutable.ArrayBuffer
+import scala.collection.mutable
+
+import com.google.common.io.ByteStreams
+
+import org.apache.spark.{Aggregator, SparkEnv, Logging, Partitioner}
+import org.apache.spark.serializer.Serializer
+import org.apache.spark.storage.BlockId
+
+/**
+ * Sorts and potentially merges a number of key-value pairs of type (K, V) 
to produce key-combiner
+ * pairs of type (K, C). Uses a Partitioner to first group the keys into 
partitions, and then
+ * optionally sorts keys within each partition using a custom Comparator. 
Can output a single
+ * partitioned file with a different byte range for each partition, 
suitable for shuffle fetches.
+ *
+ * If combining is disabled, the type C must equal V -- we'll cast the 
objects at the end.
+ *
+ * @param aggregator optional Aggregator with combine functions to use for 
merging data
+ * @param partitioner optional partitioner; if given, sort by partition ID 
and then key
+ * @param ordering optional ordering to sort keys within each partition
+ * @param serializer serializer to use when spilling to disk
+ */
+private[spark] class ExternalSorter[K, V, C](
+aggregator: Option[Aggregator[K, V, C]] = None,
+partitioner: Option[Partitioner] = None,
+ordering: Option[Ordering[K]] = None,
+serializer: Option[Serializer] = None) extends Logging {
+
+  private val numPartitions = partitioner.map(_.numPartitions).getOrElse(1)
+  private val shouldPartition = numPartitions  1
+
+  private val blockManager = SparkEnv.get.blockManager
+  private val diskBlockManager = blockManager.diskBlockManager
+  private val ser = Serializer.getSerializer(serializer)
+  private val serInstance = ser.newInstance()
+
+  private val conf = SparkEnv.get.conf
+  private val fileBufferSize = conf.getInt(spark.shuffle.file.buffer.kb, 
100) * 1024
+  private val serializerBatchSize = 
conf.getLong(spark.shuffle.spill.batchSize, 1)
+
+  private def getPartition(key: K): Int = {
+if (shouldPartition) partitioner.get.getPartition(key) else 0
+  }
+
+  // Data structures to store in-memory objects before we spill. Depending 
on whether we have an
+  // Aggregator set, we either put objects into an AppendOnlyMap where we 
combine them, or we
+  // store them in an array buffer.
+  var map = new SizeTrackingAppendOnlyMap[(Int, K), C]
+  var buffer = new SizeTrackingBuffer[((Int, K), C)]
+
+  // Track how many elements we've read before we try to estimate memory. 
Ideally we'd use
+  // map.size or buffer.size for this, but because users' Aggregators can 
potentially increase
+  // the size of a merged element when we add values with the same key, 
it's safer to track
+  // elements read from the input iterator.
+  private var elementsRead = 0L
+  private val trackMemoryThreshold = 1000
+
+  // Spilling statistics
+  private var spillCount = 0
+  private var _memoryBytesSpilled = 0L
+  private var _diskBytesSpilled = 0L
+
+  // Collective memory threshold shared across all running tasks
+  private val maxMemoryThreshold = {
+val memoryFraction = conf.getDouble(spark.shuffle.memoryFraction, 
0.3)
+val safetyFraction = conf.getDouble(spark.shuffle.safetyFraction, 
0.8)
+(Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction).toLong
+  }
+
+  // A comparator for keys K that orders them within a partition to allow 
partial aggregation.
+  // Can be a partial ordering by hash code if a total ordering is not 
provided through by the
  

[GitHub] spark pull request: [SPARK-2086] Improve output of toDebugString t...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1364#issuecomment-49665190
  
QA tests have started for PR 1364. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16921/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...

2014-07-21 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1515#issuecomment-49665253
  
Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...

2014-07-21 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1515#issuecomment-49665265
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2086] Improve output of toDebugString t...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1364#issuecomment-49665245
  
QA results for PR 1364:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16921/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1515#issuecomment-49665852
  
QA tests have started for PR 1515. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16922/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1515#issuecomment-49665933
  
QA results for PR 1515:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16922/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...

2014-07-21 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1515#discussion_r15196712
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/LocalALS.scala 
---
@@ -117,7 +129,7 @@ object LocalALS {
   }
   case _ = {
 System.err.println(Usage: LocalALS M U F iters)
-System.exit(1)
--- End diff --

Why is this removed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...

2014-07-21 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1515#discussion_r15196701
  
--- Diff: examples/src/main/python/als.py ---
@@ -17,6 +17,8 @@
 
 
 This example requires numpy (http://www.numpy.org/)
+This is an example implementation of ALS for learning how to use Spark. 
Please refer to
--- End diff --

It may be better to move this sentence before This example requires numpy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...

2014-07-21 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1515#discussion_r15196704
  
--- Diff: examples/src/main/python/als.py ---
@@ -49,6 +51,9 @@ def update(i, vec, mat, ratings):
 
 
 if __name__ == __main__:
+
+print WARNING: THIS IS A NAIVE IMPLEMENTATION OF ALS AND IS GIVEN 
AS AN EXAMPLE!
--- End diff --

print  sys.stderr, WARNING` (output to stderr) This block should be 
under the `Usage` block because `print` is part of the execution, which should 
come after doc.

Only capitalizing `WARNING` or shorter `WARN` should be sufficient. It is 
hard to read with all uppercase letters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...

2014-07-21 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1515#discussion_r15196708
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/LocalALS.scala 
---
@@ -24,7 +24,8 @@ import cern.colt.matrix.linalg._
 import cern.jet.math._
 
 /**
- * Alternating least squares matrix factorization.
+ * Alternating least squares matrix factorization. This is an example 
implementation for learning how to use Spark.
--- End diff --

Line is too long. please make it under 100 characters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...

2014-07-21 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1515#discussion_r15196717
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/LocalFileLR.scala ---
@@ -21,6 +21,10 @@ import java.util.Random
 
 import breeze.linalg.{Vector, DenseVector}
 
+/**
+ * Logistic regression based classification. This is an example 
implementation for learning how to use Spark.
--- End diff --

Line too wide. The default setting in intellij uses 120 but in Spark we use 
100. You can adjust the number in settings.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...

2014-07-21 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1515#discussion_r15196747
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/SparkALS.scala 
---
@@ -26,7 +26,8 @@ import cern.jet.math._
 import org.apache.spark._
 
 /**
- * Alternating least squares matrix factorization.
+ * Alternating least squares matrix factorization. This is an example 
implementation for learning how to use Spark.
--- End diff --

ditto: too long


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2434][MLlib]: Warning messages that poi...

2014-07-21 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1515#discussion_r15196775
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/SparkLR.scala 
---
@@ -47,12 +48,23 @@ object SparkLR {
 Array.tabulate(N)(generatePoint)
   }
 
+  def showWarning() {
+System.err.println(
+  WARNING: THIS IS A NAIVE IMPLEMENTATION OF LOGISTIC REGRESSION 
AND IS GIVEN AS AN EXAMPLE!
+|PLEASE USE THE LogisticRegression METHOD FOUND IN 
org.apache.spark.mllib.classification FOR
+|MORE CONVENTIONAL USE
+  .stripMargin)
+  }
+
   def main(args: Array[String]) {
+showWarning()
+
 val sparkConf = new SparkConf().setAppName(SparkLR)
 val sc = new SparkContext(sparkConf)
 val numSlices = if (args.length  0) args(0).toInt else 2
 val points = sc.parallelize(generateData, numSlices).cache()
 
+
--- End diff --

remove extra empty line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2454] Do not assume drivers and ex...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1472#issuecomment-49666445
  
QA tests have started for PR 1472. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16923/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2567] Resubmitted stage sometimes remai...

2014-07-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1516#issuecomment-49666932
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2567] Resubmitted stage sometimes remai...

2014-07-21 Thread tsudukim
GitHub user tsudukim opened a pull request:

https://github.com/apache/spark/pull/1516

[SPARK-2567] Resubmitted stage sometimes remains as active stage in the web 
UI

Moved the line which post SparkListenerStageSubmitted to the back of check 
of tasks size and serializability.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tsudukim/spark feature/SPARK-2567

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1516.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1516


commit 79f4f98f5d08b7f52bb0216f2b412d959e64ad89
Author: Masayoshi TSUZUKI tsudu...@oss.nttdata.co.jp
Date:   2014-07-21T21:05:42Z

[SPARK-2567] Resubmitted stage sometimes remains as active stage in the web 
UI

Moved the line which post SparkListenerStageSubmitted to the back of check 
of tasks size and serializability.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2470] PEP8 fixes to PySpark

2014-07-21 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/1505#discussion_r15197312
  
--- Diff: python/pyspark/statcounter.py ---
@@ -124,5 +125,5 @@ def sampleStdev(self):
 return math.sqrt(self.sampleVariance())
 
 def __repr__(self):
-return (count: %s, mean: %s, stdev: %s, max: %s, min: %s) % 
(self.count(), self.mean(), self.stdev(), self.max(), self.min())
-
+return (count: %s, mean: %s, stdev: %s, max: %s, min: %s) %
--- End diff --

Here Need / and tab


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2470] PEP8 fixes to PySpark

2014-07-21 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/1505#discussion_r15197389
  
--- Diff: python/pyspark/statcounter.py ---
@@ -124,5 +125,5 @@ def sampleStdev(self):
 return math.sqrt(self.sampleVariance())
 
 def __repr__(self):
-return (count: %s, mean: %s, stdev: %s, max: %s, min: %s) % 
(self.count(), self.mean(), self.stdev(), self.max(), self.min())
-
+return (count: %s, mean: %s, stdev: %s, max: %s, min: %s) %
+(self.count(), self.mean(), self.stdev(), self.max(), self.min())
--- End diff --

indent


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2470] PEP8 fixes to PySpark

2014-07-21 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/1505#discussion_r15197383
  
--- Diff: python/pyspark/statcounter.py ---
@@ -124,5 +125,5 @@ def sampleStdev(self):
 return math.sqrt(self.sampleVariance())
 
 def __repr__(self):
-return (count: %s, mean: %s, stdev: %s, max: %s, min: %s) % 
(self.count(), self.mean(), self.stdev(), self.max(), self.min())
-
+return (count: %s, mean: %s, stdev: %s, max: %s, min: %s) %
--- End diff --

Need \


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix flakey HiveQuerySuite test

2014-07-21 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1514#issuecomment-49667765
  
Thanks for the fix. Looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2470] PEP8 fixes to PySpark

2014-07-21 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/1505#discussion_r15197548
  
--- Diff: python/pyspark/shell.py ---
@@ -35,7 +35,8 @@
 from pyspark.storagelevel import StorageLevel
 
 # this is the equivalent of ADD_JARS
-add_files = os.environ.get(ADD_FILES).split(',') if 
os.environ.get(ADD_FILES) is not None else None
+add_files = (
+os.environ.get(ADD_FILES).split(',') if os.environ.get(ADD_FILES) 
is not None else None)
--- End diff --

It's better break this line before if


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2470] PEP8 fixes to PySpark

2014-07-21 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/1505#discussion_r15197618
  
--- Diff: python/pyspark/serializers.py ---
@@ -252,18 +251,20 @@ def load_stream(self, stream):
 yield pair
 
 def __eq__(self, other):
-return isinstance(other, PairDeserializer) and \
-   self.key_ser == other.key_ser and self.val_ser == 
other.val_ser
+return isinstance(other, PairDeserializer) and
+self.key_ser == other.key_ser and self.val_ser == other.val_ser
--- End diff --

??


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2470] PEP8 fixes to PySpark

2014-07-21 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/1505#discussion_r15197632
  
--- Diff: python/pyspark/serializers.py ---
@@ -229,8 +228,8 @@ def load_stream(self, stream):
 yield pair
 
 def __eq__(self, other):
-return isinstance(other, CartesianDeserializer) and \
-   self.key_ser == other.key_ser and self.val_ser == 
other.val_ser
+return isinstance(other, CartesianDeserializer) and
+self.key_ser == other.key_ser and self.val_ser == other.val_ser
--- End diff --

??


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [YARN][SPARK-2606]:In some cases,the spark UI ...

2014-07-21 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/1501#issuecomment-49668001
  
I see, the environment variable / system property for `uiRoot` may not have 
been set yet by the time we load the page. LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2470] PEP8 fixes to PySpark

2014-07-21 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/1505#discussion_r15197643
  
--- Diff: python/pyspark/serializers.py ---
@@ -197,8 +196,8 @@ def _load_stream_without_unbatching(self, stream):
 return self.serializer.load_stream(stream)
 
 def __eq__(self, other):
-return isinstance(other, BatchedSerializer) and \
-   other.serializer == self.serializer
+return isinstance(other, BatchedSerializer) and
+other.serializer == self.serializer
--- End diff --

??


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2609] Log thread ID when spilling Exter...

2014-07-21 Thread andrewor14
GitHub user andrewor14 opened a pull request:

https://github.com/apache/spark/pull/1517

[SPARK-2609] Log thread ID when spilling ExternalAppendOnlyMap

It's useful to know whether one thread is constantly spilling or multiple 
threads are spilling relatively infrequently.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andrewor14/spark external-log

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1517.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1517


commit 90e48bb52ac52d79e8d1a2ffcba02c9852d85dfc
Author: Andrew Or andrewo...@gmail.com
Date:   2014-07-21T21:14:27Z

Log thread ID when spilling




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2603][SQL] Remove unnecessary toMap and...

2014-07-21 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1504#discussion_r15197742
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala 
---
@@ -239,9 +239,9 @@ private[sql] object JsonRDD extends Logging {
   // .map(identity) is used as a workaround of non-serializable Map
   // generated by .mapValues.
   // This issue is documented at 
https://issues.scala-lang.org/browse/SI-7005
-  map.toMap.mapValues(scalafy).map(identity)
+  JMapWrapper(map).mapValues(scalafy).map(identity)
--- End diff --

Should we be using the `.asScala` / `.asJava` methods in 
[`JavaConverters`](http://www.scala-lang.org/api/2.10.2/index.html#scala.collection.JavaConverters$)
 instead of creating these classes manually.  It seems like thats more robust 
to changes in the Scala library, and they handle cases like going back and 
forth efficiently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2609] Log thread ID when spilling Exter...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1517#issuecomment-49668361
  
QA tests have started for PR 1517. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16924/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2470] PEP8 fixes to PySpark

2014-07-21 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/1505#discussion_r15197750
  
--- Diff: python/pyspark/context.py ---
@@ -192,15 +191,19 @@ def _ensure_initialized(cls, instance=None, 
gateway=None):
 SparkContext._writeToFile = 
SparkContext._jvm.PythonRDD.writeToFile
 
 if instance:
-if SparkContext._active_spark_context and 
SparkContext._active_spark_context != instance:
+if SparkContext._active_spark_context and
+SparkContext._active_spark_context != instance:
--- End diff --

??


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   3   4   >