[GitHub] spark pull request: put 'curRequestSize = 0' after 'logDebug' it

2014-07-19 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1477#issuecomment-49501450
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fixed the number of worker thread

2014-07-19 Thread fireflyc
Github user fireflyc commented on the pull request:

https://github.com/apache/spark/pull/1485#issuecomment-49501533
  
My program is spark streaming over Hadoop yarn.It work for user click 
stream.
I read code,number of worker threads and block?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: put 'curRequestSize = 0' after 'logDebug' it

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1477#issuecomment-49501569
  
QA tests have started for PR 1477. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16842/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Improve scheduler delay tooltip.

2014-07-19 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/1488#issuecomment-49501577
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Improve scheduler delay tooltip.

2014-07-19 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/1488#issuecomment-49501586
  
Jenkins, retest this *pretty* please ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-19 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1452#issuecomment-49501642
  
Thanks for taking a look. I'm merging this one as is, and will submit a 
small PR to fix the issues. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Improve scheduler delay tooltip.

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1488#issuecomment-49501654
  
QA tests have started for PR 1488. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16843/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-19 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1452#discussion_r15142372
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala 
---
@@ -17,134 +17,68 @@
 
 package org.apache.spark.scheduler
 
-import scala.language.existentials
+import java.nio.ByteBuffer
 
 import java.io._
-import java.util.zip.{GZIPInputStream, GZIPOutputStream}
-
-import scala.collection.mutable.HashMap
 
 import org.apache.spark._
-import org.apache.spark.rdd.{RDD, RDDCheckpointData}
-
-private[spark] object ResultTask {
-
-  // A simple map between the stage id to the serialized byte array of a 
task.
-  // Served as a cache for task serialization because serialization can be
-  // expensive on the master node if it needs to launch thousands of tasks.
-  private val serializedInfoCache = new HashMap[Int, Array[Byte]]
-
-  def serializeInfo(stageId: Int, rdd: RDD[_], func: (TaskContext, 
Iterator[_]) = _): Array[Byte] =
-  {
-synchronized {
-  val old = serializedInfoCache.get(stageId).orNull
-  if (old != null) {
-old
-  } else {
-val out = new ByteArrayOutputStream
-val ser = SparkEnv.get.closureSerializer.newInstance()
-val objOut = ser.serializeStream(new GZIPOutputStream(out))
-objOut.writeObject(rdd)
-objOut.writeObject(func)
-objOut.close()
-val bytes = out.toByteArray
-serializedInfoCache.put(stageId, bytes)
-bytes
-  }
-}
-  }
-
-  def deserializeInfo(stageId: Int, bytes: Array[Byte]): (RDD[_], 
(TaskContext, Iterator[_]) = _) =
-  {
-val in = new GZIPInputStream(new ByteArrayInputStream(bytes))
-val ser = SparkEnv.get.closureSerializer.newInstance()
-val objIn = ser.deserializeStream(in)
-val rdd = objIn.readObject().asInstanceOf[RDD[_]]
-val func = objIn.readObject().asInstanceOf[(TaskContext, Iterator[_]) 
= _]
-(rdd, func)
-  }
-
-  def removeStage(stageId: Int) {
-serializedInfoCache.remove(stageId)
-  }
-
-  def clearCache() {
-synchronized {
-  serializedInfoCache.clear()
-}
-  }
-}
-
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.rdd.RDD
 
 /**
  * A task that sends back the output to the driver application.
  *
- * See [[org.apache.spark.scheduler.Task]] for more information.
+ * See [[Task]] for more information.
  *
  * @param stageId id of the stage this task belongs to
- * @param rdd input to func
+ * @param rddBinary broadcast version of of the serialized RDD
  * @param func a function to apply on a partition of the RDD
- * @param _partitionId index of the number in the RDD
+ * @param partition partition of the RDD this task is associated with
  * @param locs preferred task execution locations for locality scheduling
  * @param outputId index of the task in this job (a job can launch tasks 
on only a subset of the
  * input RDD's partitions).
  */
 private[spark] class ResultTask[T, U](
 stageId: Int,
-var rdd: RDD[T],
-var func: (TaskContext, Iterator[T]) = U,
-_partitionId: Int,
+val rddBinary: Broadcast[Array[Byte]],
+val func: (TaskContext, Iterator[T]) = U,
+val partition: Partition,
 @transient locs: Seq[TaskLocation],
-var outputId: Int)
-  extends Task[U](stageId, _partitionId) with Externalizable {
-
-  def this() = this(0, null, null, 0, null, 0)
-
-  var split = if (rdd == null) null else rdd.partitions(partitionId)
+val outputId: Int)
+  extends Task[U](stageId, partition.index) with Serializable {
--- End diff --

@mateiz and I looked and it seems so.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-19 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49501954
  
As for benchmarks, the micro benchmark code comes with #758 may be helpful. 
And I feel that partitioning support for Parquet should be considered together 
with the refactoring @yhuai suggested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2372 [MLLIB] Grouped Optimization/Learni...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1292#issuecomment-49502313
  
QA tests have started for PR 1292. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16844/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2495][MLLIB] remove private[mllib] from...

2014-07-19 Thread mengxr
GitHub user mengxr opened a pull request:

https://github.com/apache/spark/pull/1492

[SPARK-2495][MLLIB] remove private[mllib] from linear models' constructors

This is part of SPARK-2495 to allow users construct linear models manually.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mengxr/spark public-constructor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1492.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1492


commit a48b766dd6c19e981e4af41f27abc8163d761083
Author: Xiangrui Meng m...@databricks.com
Date:   2014-07-19T08:20:44Z

remove private[mllib] from linear models' constructors




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: put 'curRequestSize = 0' after 'logDebug' it

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1477#issuecomment-49503177
  
QA results for PR 1477:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16842/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2495][MLLIB] remove private[mllib] from...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1492#issuecomment-49503225
  
QA tests have started for PR 1492. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16845/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: put 'curRequestSize = 0' after 'logDebug' it

2014-07-19 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1477#issuecomment-49503209
  
Thanks. Merging in master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: put 'curRequestSize = 0' after 'logDebug' it

2014-07-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1477


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Improve scheduler delay tooltip.

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1488#issuecomment-49503309
  
QA results for PR 1488:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16843/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2552][MLLIB] stabilize logistic functio...

2014-07-19 Thread mengxr
GitHub user mengxr opened a pull request:

https://github.com/apache/spark/pull/1493

[SPARK-2552][MLLIB] stabilize logistic function in pyspark

to avoid overflow in `exp(x)` if `x` is large.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mengxr/spark py-logistic

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1493.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1493


commit 259e863d96bcc54fc3f74e41b35e4c7494d0476f
Author: Xiangrui Meng m...@databricks.com
Date:   2014-07-19T08:51:38Z

stabilize logistic function in pyspark




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Required AM memory is amMem, not args.amMem...

2014-07-19 Thread maji2014
Github user maji2014 closed the pull request at:

https://github.com/apache/spark/pull/1457


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2552][MLLIB] stabilize logistic functio...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1493#issuecomment-49504244
  
QA tests have started for PR 1493. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16847/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Required AM memory is amMem, not args.amMem...

2014-07-19 Thread maji2014
GitHub user maji2014 opened a pull request:

https://github.com/apache/spark/pull/1494

Required AM memory is amMem, not args.amMemory

ERROR yarn.Client: Required AM memory (1024) is above the max threshold 
(1048) of this cluster appears if this code is not changed. obviously, 1024 is 
less than 1048, so change this

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maji2014/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1494.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1494


commit b0f66400990befead3a6aaa1172112bd090272e8
Author: derek ma ma...@asiainfo-linkage.com
Date:   2014-07-19T10:53:08Z

Required AM memory is amMem, not args.amMemory

ERROR yarn.Client: Required AM memory (1024) is above the max threshold 
(1048) of this cluster appears if this code is not changed. obviously, 1024 is 
less than 1048, so change this




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Required AM memory is amMem, not args.amMem...

2014-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1494#issuecomment-49506353
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1997] update breeze to version 0.8.1

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/940#issuecomment-49508061
  
QA tests have started for PR 940. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16848/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added Parser of SQL SUBSTR()

2014-07-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1442


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added Parser of SQL SUBSTR()

2014-07-19 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1442#issuecomment-49513281
  
Thanks!  I've merged this into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1097: Do not introduce deadlock while fi...

2014-07-19 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/1409#issuecomment-49513420
  
@aarondav  @pwendell 
In my tests, it seems that there are still a deadlock.

To find a possible reason this here [Executor.scala#L189] 
(https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L189)
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2552][MLLIB] stabilize logistic functio...

2014-07-19 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1493#issuecomment-49515870
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2495][MLLIB] remove private[mllib] from...

2014-07-19 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1492#issuecomment-49515873
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2495][MLLIB] remove private[mllib] from...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1492#issuecomment-49515905
  
QA tests have started for PR 1492. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16850/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2552][MLLIB] stabilize logistic functio...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1493#issuecomment-49515915
  
QA tests have started for PR 1493. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16849/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-49516746
  
QA tests have started for PR 1165. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16851/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fixed the number of worker thread

2014-07-19 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/1485#issuecomment-49526386
  
@fireflyc Spark should not be scheduling more than N concurrent tasks on an 
Executor. It appears that the tasks may be returning success but then don't 
actually return the thread to the thread pool. 

This is itself a bug -- could you run jstack on your Executor process to 
see where the threads are stuck?

Perhaps new tasks are just starting before the old threads finish cleaning 
up, and thus this solution is the right one, but I'd like to find out exactly 
why.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2595:]The driver run garbage colle...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1387#issuecomment-49527449
  
QA tests have started for PR 1387. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16852/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2491]: Fix When an OOM is thrown,t...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1482#issuecomment-49527553
  
QA tests have started for PR 1482. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16853/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2491]: Fix When an OOM is thrown,t...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1482#issuecomment-49527623
  
QA results for PR 1482:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16853/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-19 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1165#discussion_r15145476
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/SizeTrackingAppendOnlyBuffer.scala
 ---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import scala.reflect.ClassTag
+
+/**
+ * An append-only buffer that keeps track of its estimated size in bytes.
+ */
+private[spark] class SizeTrackingAppendOnlyBuffer[T: ClassTag]
--- End diff --

Better to call this SizeTrackingVector


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2491]: Fix When an OOM is thrown,t...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1482#issuecomment-49527896
  
QA tests have started for PR 1482. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16854/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2583] ConnectionManager cannot distingu...

2014-07-19 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1490#discussion_r15145612
  
--- Diff: 
core/src/main/scala/org/apache/spark/network/MessageChunkHeader.scala ---
@@ -41,6 +42,13 @@ private[spark] class MessageChunkHeader(
   putInt(totalSize).
   putInt(chunkSize).
   putInt(other).
+  put{
--- End diff --

How about 
```scala
put(if (hasError) 1.asInstanceOf[Byte] else 0.asInstanceOf[Byte])
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2583] ConnectionManager cannot distingu...

2014-07-19 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1490#discussion_r15145614
  
--- Diff: 
core/src/main/scala/org/apache/spark/network/MessageChunkHeader.scala ---
@@ -67,13 +75,20 @@ private[spark] object MessageChunkHeader {
 val totalSize = buffer.getInt()
 val chunkSize = buffer.getInt()
 val other = buffer.getInt()
+val hasError = {
--- End diff --

```scala
val hasError = buffer.get() != 0
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2583] ConnectionManager cannot distingu...

2014-07-19 Thread sarutak
Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/1490#issuecomment-49528521
  
Thanks @rxin I'll try it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Typo fix to the programming guide in the docs

2014-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1495#issuecomment-49529878
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Typo fix to the programming guide in the docs

2014-07-19 Thread cesararevalo
GitHub user cesararevalo opened a pull request:

https://github.com/apache/spark/pull/1495

Typo fix to the programming guide in the docs

Typo fix to the programming guide in the docs. Changed the word 
distibuted to distributed.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cesararevalo/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1495.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1495


commit 0c2e3a71d51705c6010f48261426fcf7392d8a86
Author: Cesar Arevalo ce...@zephyrhealthinc.com
Date:   2014-07-19T21:20:05Z

Typo fix to the programming guide in the docs




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-07-19 Thread mburke13
Github user mburke13 commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-49531526
  
@bgreeven Are you continuing work on this pull request so that it passes 
all unit tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2596 A tool for mirroring github pull re...

2014-07-19 Thread pwendell
GitHub user pwendell opened a pull request:

https://github.com/apache/spark/pull/1496

SPARK-2596 A tool for mirroring github pull requests on JIRA.

For a bunch of reasons we should automatically populate a JIRA with 
information about new pull requests when they arrive. I've written a small 
python script to do this that we can run from Jenkins every 5 or 10 minutes to 
keep things in sync.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/pwendell/spark github-integration

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1496.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1496


commit 2087b693f3423c35044c3cc1dcca867d89076111
Author: Patrick Wendell pwend...@gmail.com
Date:   2014-07-19T22:21:29Z

SPARK-2596 A tool for mirroring github pull requests on JIRA.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2596 A tool for mirroring github pull re...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1496#issuecomment-49532480
  
QA tests have started for PR 1496. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16855/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-19 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1452#issuecomment-49532568
  
Apparently this broke the build. Reverting and will work on a fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-19 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1452#issuecomment-49532564
  
@rxin @mateiz this has broken the master build so we should revert it. If 
you look here there was never actually a success message from SparkQA - I 
think the tests are hanging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2596 A tool for mirroring github pull re...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1496#issuecomment-49532713
  
QA tests have started for PR 1496. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16856/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2596 A tool for mirroring github pull re...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1496#issuecomment-49533119
  
QA tests have started for PR 1496. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16857/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2596 A tool for mirroring github pull re...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1496#issuecomment-49533420
  
QA tests have started for PR 1496. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16858/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2596 A tool for mirroring github pull re...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1496#issuecomment-49533533
  
QA tests have started for PR 1496. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16859/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2596 A tool for mirroring github pull re...

2014-07-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1496


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2596 A tool for mirroring github pull re...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1496#issuecomment-49534890
  
QA results for PR 1496:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16859/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-19 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1452#issuecomment-49534978
  
Hah, the new Spark QA messages are really confusing! Is there no timeout on 
the build?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with unre...

2014-07-19 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/1497

SPARK-2226:  transform HAVING clauses with unresolvable attributes

This commit adds an analyzer rule to
  1. find expressions in `HAVING` clause filters that depend on unresolved 
attributes, 
  2. push these expressions down to the underlying aggregates, and then
  3. project them away above the filter.

It also enables the `HAVING` queries in the Hive compatibility suite.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark spark-2226

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1497.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1497


commit 29a26e3ab6a21e6619f003d905bc7aa7d1cb2976
Author: William Benton wi...@redhat.com
Date:   2014-07-17T15:36:37Z

Added rule to handle unresolved attributes in HAVING clauses (SPARK-2226)

commit c7f2b2c8a19b09ec095a316cb965f18d474d7144
Author: William Benton wi...@redhat.com
Date:   2014-07-17T17:16:18Z

Whitelist HAVING queries.

Also adds golden outputs for HAVING tests.

commit 5a12647c169ee06bba5355c3956a158699247e43
Author: William Benton wi...@redhat.com
Date:   2014-07-19T17:08:17Z

Explanatory comments and stylistic cleanups.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with unre...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1497#issuecomment-49535141
  
QA tests have started for PR 1497. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16860/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Typo fix to the programming guide in the docs

2014-07-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1495


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2587: Fix error message in make-distribu...

2014-07-19 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1489#issuecomment-49535984
  
Ah, good catch. This was my mistake. Thanks for this! I'll merge it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2587: Fix error message in make-distribu...

2014-07-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1489


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-19 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1165#discussion_r15147416
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/SizeTrackingAppendOnlyBuffer.scala
 ---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import scala.reflect.ClassTag
+
+/**
+ * An append-only buffer that keeps track of its estimated size in bytes.
+ */
+private[spark] class SizeTrackingAppendOnlyBuffer[T: ClassTag]
+  extends PrimitiveVector[T]
+  with SizeTracker {
+
+  override def +=(value: T): Unit = {
+super.+=(value)
+super.afterUpdate()
+  }
+
+  override def resize(newLength: Int): PrimitiveVector[T] = {
+super.resize(newLength)
+resetSamples()
+this
+  }
+
+  override def array: Array[T] = {
--- End diff --

It should be documented that this trims the array to the right size, since 
it looks like a field access now. Also, I'd call it toArray instead to be more 
in line with other Scala collections.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-19 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1165#discussion_r15147419
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -463,16 +463,15 @@ private[spark] class BlockManager(
   val values = dataDeserialize(blockId, bytes)
   if (level.deserialized) {
 // Cache the values before returning them
-// TODO: Consider creating a putValues that also takes in 
a iterator?
-val valuesBuffer = new ArrayBuffer[Any]
-valuesBuffer ++= values
-memoryStore.putValues(blockId, valuesBuffer, level, 
returnValues = true).data
-  match {
-case Left(values2) =
-  return Some(new BlockResult(values2, 
DataReadMethod.Disk, info.size))
-case _ =
-  throw new SparkException(Memory store did not 
return back an iterator)
-  }
+val putResult = memoryStore.putValues(
+  blockId, values, level, returnValues = true, 
allowPersistToDisk = false)
+putResult.data match {
+  case Left(it) =
+return Some(new BlockResult(it, DataReadMethod.Disk, 
info.size))
+  case _ =
+// This only happens if we dropped the values back to 
disk (which is never)
+throw new SparkException(Memory store did not return 
an iterator!)
+}
--- End diff --

Isn't it possible that as we unroll the partition here, it will be too 
large? It's certainly less common than it being too large the first time we 
read it, but I can see it happening. I'm thinking of the case where someone 
stores a block as MEMORY_AND_DISK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-19 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1165#discussion_r15147423
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -463,16 +463,15 @@ private[spark] class BlockManager(
   val values = dataDeserialize(blockId, bytes)
   if (level.deserialized) {
 // Cache the values before returning them
-// TODO: Consider creating a putValues that also takes in 
a iterator?
-val valuesBuffer = new ArrayBuffer[Any]
-valuesBuffer ++= values
-memoryStore.putValues(blockId, valuesBuffer, level, 
returnValues = true).data
-  match {
-case Left(values2) =
-  return Some(new BlockResult(values2, 
DataReadMethod.Disk, info.size))
-case _ =
-  throw new SparkException(Memory store did not 
return back an iterator)
-  }
+val putResult = memoryStore.putValues(
+  blockId, values, level, returnValues = true, 
allowPersistToDisk = false)
+putResult.data match {
+  case Left(it) =
+return Some(new BlockResult(it, DataReadMethod.Disk, 
info.size))
+  case _ =
+// This only happens if we dropped the values back to 
disk (which is never)
+throw new SparkException(Memory store did not return 
an iterator!)
+}
--- End diff --

Ah never mind, I see that this checks for it. Might be worthwhile to add a 
comment here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2564. ShuffleReadMetrics.totalBlocksRead...

2014-07-19 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1474#issuecomment-49536204
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-19 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1165#discussion_r15147427
  
--- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
@@ -140,14 +145,36 @@ private[spark] class CacheManager(blockManager: 
BlockManager) extends Logging {
   throw new BlockException(key, sBlock manager failed to return 
cached value for $key!)
   }
 } else {
-  /* This RDD is to be cached in memory. In this case we cannot pass 
the computed values
+  /*
+   * This RDD is to be cached in memory. In this case we cannot pass 
the computed values
* to the BlockManager as an iterator and expect to read it back 
later. This is because
-   * we may end up dropping a partition from memory store before 
getting it back, e.g.
-   * when the entirety of the RDD does not fit in memory. */
-  val elements = new ArrayBuffer[Any]
-  elements ++= values
-  updatedBlocks ++= blockManager.put(key, elements, storageLevel, 
tellMaster = true)
-  elements.iterator.asInstanceOf[Iterator[T]]
+   * we may end up dropping a partition from memory store before 
getting it back.
+   *
+   * In addition, we must be careful to not unroll the entire 
partition in memory at once.
+   * Otherwise, we may cause an OOM exception if the JVM does not have 
enough space for this
+   * single partition. Instead, we unroll the values cautiously, 
potentially aborting and
+   * dropping the partition to disk if applicable.
+   */
+  blockManager.memoryStore.unrollSafely(key, values, updatedBlocks) 
match {
+case Left(arr) =
+  // We have successfully unrolled the entire partition, so cache 
it in memory
+  updatedBlocks ++=
+blockManager.putArray(key, arr, level, tellMaster = true, 
effectiveStorageLevel)
+  arr.iterator.asInstanceOf[Iterator[T]]
+case Right(it) =
+  // There is not enough space to cache this partition in memory
+  logWarning(sNot enough space to cache $key in memory!  +
+sFree memory is ${blockManager.memoryStore.freeMemory} 
bytes.)
+  var returnValues = it.asInstanceOf[Iterator[T]]
+  if (putLevel.useDisk) {
+logWarning(sPersisting $key to disk instead.)
+val diskOnlyLevel = StorageLevel(useDisk = true, useMemory = 
false,
+  useOffHeap = false, deserialized = false, 
putLevel.replication)
+returnValues =
+  putInBlockManager[T](key, returnValues, level, 
updatedBlocks, Some(diskOnlyLevel))
+  }
+  returnValues
+  }
--- End diff --

I wonder if we could move this logic to BlockManager later. Probably can be 
part of another PR, but it's pretty complicated to have this here and then 
similar logic in there when it's doing a get (not to mention in its put 
methods).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-19 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1165#discussion_r15147430
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/SizeTracker.scala ---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.util.SizeEstimator
+
+/**
+ * A general interface for collections to keep track of their estimated 
sizes in bytes.
+ * We sample with a slow exponential back-off using the SizeEstimator to 
amortize the time,
+ * as each call to SizeEstimator is somewhat expensive (order of a few 
milliseconds).
+ */
+private[spark] trait SizeTracker {
+
+  import SizeTracker._
+
+  /**
+   * Controls the base of the exponential which governs the rate of 
sampling.
+   * E.g., a value of 2 would mean we sample at 1, 2, 4, 8, ... elements.
+   */
+  private val SAMPLE_GROWTH_RATE = 1.1
+
+  /** Samples taken since last resetSamples(). Only the last two are kept 
for extrapolation. */
+  private val samples = new ArrayBuffer[Sample]
--- End diff --

Since we only use the last two, can we make this another data structure? 
You could even make it a Queue. It's kind of confusing to see an ArrayBuffer 
but then a comment that only two are used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2564. ShuffleReadMetrics.totalBlocksRead...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1474#issuecomment-49536289
  
QA tests have started for PR 1474. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16861/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2524] missing document about spark.depl...

2014-07-19 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1443#issuecomment-49536333
  
Thanks - I can merge this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2564. ShuffleReadMetrics.totalBlocksRead...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1474#issuecomment-49536330
  
QA results for PR 1474:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16861/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2524] missing document about spark.depl...

2014-07-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1443


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2086] Improve output of toDebugString t...

2014-07-19 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1364#issuecomment-49536424
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2086] Improve output of toDebugString t...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1364#issuecomment-49536483
  
QA tests have started for PR 1364. This patch DID NOT merge cleanly! 
brView progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16862/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2086] Improve output of toDebugString t...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1364#issuecomment-49536491
  
QA results for PR 1364:br- This patch FAILED unit tests.brbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16862/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with aggr...

2014-07-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1497#issuecomment-49536515
  
QA results for PR 1497:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16860/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2310. Support arbitrary Spark properties...

2014-07-19 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1253#issuecomment-49536672
  
Hey @sryza - I did a straw poll offline discussing this with a few other 
contributors. The consensus was that it might be better to have a `--conf` flag 
with an `=` sign instead of representing spark conf properties directly as 
flags.

I.e. `--conf spark.app.name=blah`

On admittedly bad thing about this approach is that if users have arguments 
with spaces in them, they will have to quote the entire thing:

```
./bin/spark-submit --conf spark.app.name=My app
```

Which might not be intuitive, so it would be good to document that 
(provided you are generally okay with this proposed syntax).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---