[GitHub] spark pull request: put 'curRequestSize = 0' after 'logDebug' it
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1477#issuecomment-49501450 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fixed the number of worker thread
Github user fireflyc commented on the pull request: https://github.com/apache/spark/pull/1485#issuecomment-49501533 My program is spark streaming over Hadoop yarn.It work for user click stream. I read code,number of worker threads and block? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: put 'curRequestSize = 0' after 'logDebug' it
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1477#issuecomment-49501569 QA tests have started for PR 1477. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16842/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Improve scheduler delay tooltip.
Github user kayousterhout commented on the pull request: https://github.com/apache/spark/pull/1488#issuecomment-49501577 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Improve scheduler delay tooltip.
Github user kayousterhout commented on the pull request: https://github.com/apache/spark/pull/1488#issuecomment-49501586 Jenkins, retest this *pretty* please ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1452#issuecomment-49501642 Thanks for taking a look. I'm merging this one as is, and will submit a small PR to fix the issues. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Improve scheduler delay tooltip.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1488#issuecomment-49501654 QA tests have started for PR 1488. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16843/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1452#discussion_r15142372 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala --- @@ -17,134 +17,68 @@ package org.apache.spark.scheduler -import scala.language.existentials +import java.nio.ByteBuffer import java.io._ -import java.util.zip.{GZIPInputStream, GZIPOutputStream} - -import scala.collection.mutable.HashMap import org.apache.spark._ -import org.apache.spark.rdd.{RDD, RDDCheckpointData} - -private[spark] object ResultTask { - - // A simple map between the stage id to the serialized byte array of a task. - // Served as a cache for task serialization because serialization can be - // expensive on the master node if it needs to launch thousands of tasks. - private val serializedInfoCache = new HashMap[Int, Array[Byte]] - - def serializeInfo(stageId: Int, rdd: RDD[_], func: (TaskContext, Iterator[_]) = _): Array[Byte] = - { -synchronized { - val old = serializedInfoCache.get(stageId).orNull - if (old != null) { -old - } else { -val out = new ByteArrayOutputStream -val ser = SparkEnv.get.closureSerializer.newInstance() -val objOut = ser.serializeStream(new GZIPOutputStream(out)) -objOut.writeObject(rdd) -objOut.writeObject(func) -objOut.close() -val bytes = out.toByteArray -serializedInfoCache.put(stageId, bytes) -bytes - } -} - } - - def deserializeInfo(stageId: Int, bytes: Array[Byte]): (RDD[_], (TaskContext, Iterator[_]) = _) = - { -val in = new GZIPInputStream(new ByteArrayInputStream(bytes)) -val ser = SparkEnv.get.closureSerializer.newInstance() -val objIn = ser.deserializeStream(in) -val rdd = objIn.readObject().asInstanceOf[RDD[_]] -val func = objIn.readObject().asInstanceOf[(TaskContext, Iterator[_]) = _] -(rdd, func) - } - - def removeStage(stageId: Int) { -serializedInfoCache.remove(stageId) - } - - def clearCache() { -synchronized { - serializedInfoCache.clear() -} - } -} - +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.rdd.RDD /** * A task that sends back the output to the driver application. * - * See [[org.apache.spark.scheduler.Task]] for more information. + * See [[Task]] for more information. * * @param stageId id of the stage this task belongs to - * @param rdd input to func + * @param rddBinary broadcast version of of the serialized RDD * @param func a function to apply on a partition of the RDD - * @param _partitionId index of the number in the RDD + * @param partition partition of the RDD this task is associated with * @param locs preferred task execution locations for locality scheduling * @param outputId index of the task in this job (a job can launch tasks on only a subset of the * input RDD's partitions). */ private[spark] class ResultTask[T, U]( stageId: Int, -var rdd: RDD[T], -var func: (TaskContext, Iterator[T]) = U, -_partitionId: Int, +val rddBinary: Broadcast[Array[Byte]], +val func: (TaskContext, Iterator[T]) = U, +val partition: Partition, @transient locs: Seq[TaskLocation], -var outputId: Int) - extends Task[U](stageId, _partitionId) with Externalizable { - - def this() = this(0, null, null, 0, null, 0) - - var split = if (rdd == null) null else rdd.partitions(partitionId) +val outputId: Int) + extends Task[U](stageId, partition.index) with Serializable { --- End diff -- @mateiz and I looked and it seems so. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49501954 As for benchmarks, the micro benchmark code comes with #758 may be helpful. And I feel that partitioning support for Parquet should be considered together with the refactoring @yhuai suggested. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2372 [MLLIB] Grouped Optimization/Learni...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1292#issuecomment-49502313 QA tests have started for PR 1292. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16844/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2495][MLLIB] remove private[mllib] from...
GitHub user mengxr opened a pull request: https://github.com/apache/spark/pull/1492 [SPARK-2495][MLLIB] remove private[mllib] from linear models' constructors This is part of SPARK-2495 to allow users construct linear models manually. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mengxr/spark public-constructor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1492.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1492 commit a48b766dd6c19e981e4af41f27abc8163d761083 Author: Xiangrui Meng m...@databricks.com Date: 2014-07-19T08:20:44Z remove private[mllib] from linear models' constructors --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: put 'curRequestSize = 0' after 'logDebug' it
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1477#issuecomment-49503177 QA results for PR 1477:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16842/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2495][MLLIB] remove private[mllib] from...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1492#issuecomment-49503225 QA tests have started for PR 1492. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16845/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: put 'curRequestSize = 0' after 'logDebug' it
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1477#issuecomment-49503209 Thanks. Merging in master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: put 'curRequestSize = 0' after 'logDebug' it
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1477 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Improve scheduler delay tooltip.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1488#issuecomment-49503309 QA results for PR 1488:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16843/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2552][MLLIB] stabilize logistic functio...
GitHub user mengxr opened a pull request: https://github.com/apache/spark/pull/1493 [SPARK-2552][MLLIB] stabilize logistic function in pyspark to avoid overflow in `exp(x)` if `x` is large. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mengxr/spark py-logistic Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1493.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1493 commit 259e863d96bcc54fc3f74e41b35e4c7494d0476f Author: Xiangrui Meng m...@databricks.com Date: 2014-07-19T08:51:38Z stabilize logistic function in pyspark --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Required AM memory is amMem, not args.amMem...
Github user maji2014 closed the pull request at: https://github.com/apache/spark/pull/1457 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2552][MLLIB] stabilize logistic functio...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1493#issuecomment-49504244 QA tests have started for PR 1493. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16847/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Required AM memory is amMem, not args.amMem...
GitHub user maji2014 opened a pull request: https://github.com/apache/spark/pull/1494 Required AM memory is amMem, not args.amMemory ERROR yarn.Client: Required AM memory (1024) is above the max threshold (1048) of this cluster appears if this code is not changed. obviously, 1024 is less than 1048, so change this You can merge this pull request into a Git repository by running: $ git pull https://github.com/maji2014/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1494.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1494 commit b0f66400990befead3a6aaa1172112bd090272e8 Author: derek ma ma...@asiainfo-linkage.com Date: 2014-07-19T10:53:08Z Required AM memory is amMem, not args.amMemory ERROR yarn.Client: Required AM memory (1024) is above the max threshold (1048) of this cluster appears if this code is not changed. obviously, 1024 is less than 1048, so change this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Required AM memory is amMem, not args.amMem...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1494#issuecomment-49506353 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1997] update breeze to version 0.8.1
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/940#issuecomment-49508061 QA tests have started for PR 940. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16848/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added Parser of SQL SUBSTR()
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1442 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added Parser of SQL SUBSTR()
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1442#issuecomment-49513281 Thanks! I've merged this into master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1097: Do not introduce deadlock while fi...
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/1409#issuecomment-49513420 @aarondav @pwendell In my tests, it seems that there are still a deadlock. To find a possible reason this here [Executor.scala#L189] (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L189) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2552][MLLIB] stabilize logistic functio...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1493#issuecomment-49515870 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2495][MLLIB] remove private[mllib] from...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1492#issuecomment-49515873 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2495][MLLIB] remove private[mllib] from...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1492#issuecomment-49515905 QA tests have started for PR 1492. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16850/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2552][MLLIB] stabilize logistic functio...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1493#issuecomment-49515915 QA tests have started for PR 1493. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16849/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1165#issuecomment-49516746 QA tests have started for PR 1165. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16851/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fixed the number of worker thread
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/1485#issuecomment-49526386 @fireflyc Spark should not be scheduling more than N concurrent tasks on an Executor. It appears that the tasks may be returning success but then don't actually return the thread to the thread pool. This is itself a bug -- could you run jstack on your Executor process to see where the threads are stuck? Perhaps new tasks are just starting before the old threads finish cleaning up, and thus this solution is the right one, but I'd like to find out exactly why. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][SPARK-2595:]The driver run garbage colle...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1387#issuecomment-49527449 QA tests have started for PR 1387. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16852/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][SPARK-2491]: Fix When an OOM is thrown,t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1482#issuecomment-49527553 QA tests have started for PR 1482. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16853/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][SPARK-2491]: Fix When an OOM is thrown,t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1482#issuecomment-49527623 QA results for PR 1482:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16853/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1165#discussion_r15145476 --- Diff: core/src/main/scala/org/apache/spark/util/collection/SizeTrackingAppendOnlyBuffer.scala --- @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.collection + +import scala.reflect.ClassTag + +/** + * An append-only buffer that keeps track of its estimated size in bytes. + */ +private[spark] class SizeTrackingAppendOnlyBuffer[T: ClassTag] --- End diff -- Better to call this SizeTrackingVector --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][SPARK-2491]: Fix When an OOM is thrown,t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1482#issuecomment-49527896 QA tests have started for PR 1482. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16854/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2583] ConnectionManager cannot distingu...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1490#discussion_r15145612 --- Diff: core/src/main/scala/org/apache/spark/network/MessageChunkHeader.scala --- @@ -41,6 +42,13 @@ private[spark] class MessageChunkHeader( putInt(totalSize). putInt(chunkSize). putInt(other). + put{ --- End diff -- How about ```scala put(if (hasError) 1.asInstanceOf[Byte] else 0.asInstanceOf[Byte]) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2583] ConnectionManager cannot distingu...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1490#discussion_r15145614 --- Diff: core/src/main/scala/org/apache/spark/network/MessageChunkHeader.scala --- @@ -67,13 +75,20 @@ private[spark] object MessageChunkHeader { val totalSize = buffer.getInt() val chunkSize = buffer.getInt() val other = buffer.getInt() +val hasError = { --- End diff -- ```scala val hasError = buffer.get() != 0 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2583] ConnectionManager cannot distingu...
Github user sarutak commented on the pull request: https://github.com/apache/spark/pull/1490#issuecomment-49528521 Thanks @rxin I'll try it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Typo fix to the programming guide in the docs
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1495#issuecomment-49529878 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Typo fix to the programming guide in the docs
GitHub user cesararevalo opened a pull request: https://github.com/apache/spark/pull/1495 Typo fix to the programming guide in the docs Typo fix to the programming guide in the docs. Changed the word distibuted to distributed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cesararevalo/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1495.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1495 commit 0c2e3a71d51705c6010f48261426fcf7392d8a86 Author: Cesar Arevalo ce...@zephyrhealthinc.com Date: 2014-07-19T21:20:05Z Typo fix to the programming guide in the docs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...
Github user mburke13 commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-49531526 @bgreeven Are you continuing work on this pull request so that it passes all unit tests? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2596 A tool for mirroring github pull re...
GitHub user pwendell opened a pull request: https://github.com/apache/spark/pull/1496 SPARK-2596 A tool for mirroring github pull requests on JIRA. For a bunch of reasons we should automatically populate a JIRA with information about new pull requests when they arrive. I've written a small python script to do this that we can run from Jenkins every 5 or 10 minutes to keep things in sync. You can merge this pull request into a Git repository by running: $ git pull https://github.com/pwendell/spark github-integration Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1496.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1496 commit 2087b693f3423c35044c3cc1dcca867d89076111 Author: Patrick Wendell pwend...@gmail.com Date: 2014-07-19T22:21:29Z SPARK-2596 A tool for mirroring github pull requests on JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2596 A tool for mirroring github pull re...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1496#issuecomment-49532480 QA tests have started for PR 1496. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16855/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1452#issuecomment-49532568 Apparently this broke the build. Reverting and will work on a fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1452#issuecomment-49532564 @rxin @mateiz this has broken the master build so we should revert it. If you look here there was never actually a success message from SparkQA - I think the tests are hanging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2596 A tool for mirroring github pull re...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1496#issuecomment-49532713 QA tests have started for PR 1496. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16856/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2596 A tool for mirroring github pull re...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1496#issuecomment-49533119 QA tests have started for PR 1496. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16857/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2596 A tool for mirroring github pull re...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1496#issuecomment-49533420 QA tests have started for PR 1496. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16858/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2596 A tool for mirroring github pull re...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1496#issuecomment-49533533 QA tests have started for PR 1496. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16859/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2596 A tool for mirroring github pull re...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1496 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2596 A tool for mirroring github pull re...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1496#issuecomment-49534890 QA results for PR 1496:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16859/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1452#issuecomment-49534978 Hah, the new Spark QA messages are really confusing! Is there no timeout on the build? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with unre...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/1497 SPARK-2226: transform HAVING clauses with unresolvable attributes This commit adds an analyzer rule to 1. find expressions in `HAVING` clause filters that depend on unresolved attributes, 2. push these expressions down to the underlying aggregates, and then 3. project them away above the filter. It also enables the `HAVING` queries in the Hive compatibility suite. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark spark-2226 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1497.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1497 commit 29a26e3ab6a21e6619f003d905bc7aa7d1cb2976 Author: William Benton wi...@redhat.com Date: 2014-07-17T15:36:37Z Added rule to handle unresolved attributes in HAVING clauses (SPARK-2226) commit c7f2b2c8a19b09ec095a316cb965f18d474d7144 Author: William Benton wi...@redhat.com Date: 2014-07-17T17:16:18Z Whitelist HAVING queries. Also adds golden outputs for HAVING tests. commit 5a12647c169ee06bba5355c3956a158699247e43 Author: William Benton wi...@redhat.com Date: 2014-07-19T17:08:17Z Explanatory comments and stylistic cleanups. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with unre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1497#issuecomment-49535141 QA tests have started for PR 1497. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16860/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Typo fix to the programming guide in the docs
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1495 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2587: Fix error message in make-distribu...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1489#issuecomment-49535984 Ah, good catch. This was my mistake. Thanks for this! I'll merge it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2587: Fix error message in make-distribu...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1489 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1165#discussion_r15147416 --- Diff: core/src/main/scala/org/apache/spark/util/collection/SizeTrackingAppendOnlyBuffer.scala --- @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.collection + +import scala.reflect.ClassTag + +/** + * An append-only buffer that keeps track of its estimated size in bytes. + */ +private[spark] class SizeTrackingAppendOnlyBuffer[T: ClassTag] + extends PrimitiveVector[T] + with SizeTracker { + + override def +=(value: T): Unit = { +super.+=(value) +super.afterUpdate() + } + + override def resize(newLength: Int): PrimitiveVector[T] = { +super.resize(newLength) +resetSamples() +this + } + + override def array: Array[T] = { --- End diff -- It should be documented that this trims the array to the right size, since it looks like a field access now. Also, I'd call it toArray instead to be more in line with other Scala collections. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1165#discussion_r15147419 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -463,16 +463,15 @@ private[spark] class BlockManager( val values = dataDeserialize(blockId, bytes) if (level.deserialized) { // Cache the values before returning them -// TODO: Consider creating a putValues that also takes in a iterator? -val valuesBuffer = new ArrayBuffer[Any] -valuesBuffer ++= values -memoryStore.putValues(blockId, valuesBuffer, level, returnValues = true).data - match { -case Left(values2) = - return Some(new BlockResult(values2, DataReadMethod.Disk, info.size)) -case _ = - throw new SparkException(Memory store did not return back an iterator) - } +val putResult = memoryStore.putValues( + blockId, values, level, returnValues = true, allowPersistToDisk = false) +putResult.data match { + case Left(it) = +return Some(new BlockResult(it, DataReadMethod.Disk, info.size)) + case _ = +// This only happens if we dropped the values back to disk (which is never) +throw new SparkException(Memory store did not return an iterator!) +} --- End diff -- Isn't it possible that as we unroll the partition here, it will be too large? It's certainly less common than it being too large the first time we read it, but I can see it happening. I'm thinking of the case where someone stores a block as MEMORY_AND_DISK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1165#discussion_r15147423 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -463,16 +463,15 @@ private[spark] class BlockManager( val values = dataDeserialize(blockId, bytes) if (level.deserialized) { // Cache the values before returning them -// TODO: Consider creating a putValues that also takes in a iterator? -val valuesBuffer = new ArrayBuffer[Any] -valuesBuffer ++= values -memoryStore.putValues(blockId, valuesBuffer, level, returnValues = true).data - match { -case Left(values2) = - return Some(new BlockResult(values2, DataReadMethod.Disk, info.size)) -case _ = - throw new SparkException(Memory store did not return back an iterator) - } +val putResult = memoryStore.putValues( + blockId, values, level, returnValues = true, allowPersistToDisk = false) +putResult.data match { + case Left(it) = +return Some(new BlockResult(it, DataReadMethod.Disk, info.size)) + case _ = +// This only happens if we dropped the values back to disk (which is never) +throw new SparkException(Memory store did not return an iterator!) +} --- End diff -- Ah never mind, I see that this checks for it. Might be worthwhile to add a comment here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2564. ShuffleReadMetrics.totalBlocksRead...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1474#issuecomment-49536204 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1165#discussion_r15147427 --- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala --- @@ -140,14 +145,36 @@ private[spark] class CacheManager(blockManager: BlockManager) extends Logging { throw new BlockException(key, sBlock manager failed to return cached value for $key!) } } else { - /* This RDD is to be cached in memory. In this case we cannot pass the computed values + /* + * This RDD is to be cached in memory. In this case we cannot pass the computed values * to the BlockManager as an iterator and expect to read it back later. This is because - * we may end up dropping a partition from memory store before getting it back, e.g. - * when the entirety of the RDD does not fit in memory. */ - val elements = new ArrayBuffer[Any] - elements ++= values - updatedBlocks ++= blockManager.put(key, elements, storageLevel, tellMaster = true) - elements.iterator.asInstanceOf[Iterator[T]] + * we may end up dropping a partition from memory store before getting it back. + * + * In addition, we must be careful to not unroll the entire partition in memory at once. + * Otherwise, we may cause an OOM exception if the JVM does not have enough space for this + * single partition. Instead, we unroll the values cautiously, potentially aborting and + * dropping the partition to disk if applicable. + */ + blockManager.memoryStore.unrollSafely(key, values, updatedBlocks) match { +case Left(arr) = + // We have successfully unrolled the entire partition, so cache it in memory + updatedBlocks ++= +blockManager.putArray(key, arr, level, tellMaster = true, effectiveStorageLevel) + arr.iterator.asInstanceOf[Iterator[T]] +case Right(it) = + // There is not enough space to cache this partition in memory + logWarning(sNot enough space to cache $key in memory! + +sFree memory is ${blockManager.memoryStore.freeMemory} bytes.) + var returnValues = it.asInstanceOf[Iterator[T]] + if (putLevel.useDisk) { +logWarning(sPersisting $key to disk instead.) +val diskOnlyLevel = StorageLevel(useDisk = true, useMemory = false, + useOffHeap = false, deserialized = false, putLevel.replication) +returnValues = + putInBlockManager[T](key, returnValues, level, updatedBlocks, Some(diskOnlyLevel)) + } + returnValues + } --- End diff -- I wonder if we could move this logic to BlockManager later. Probably can be part of another PR, but it's pretty complicated to have this here and then similar logic in there when it's doing a get (not to mention in its put methods). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1165#discussion_r15147430 --- Diff: core/src/main/scala/org/apache/spark/util/collection/SizeTracker.scala --- @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.collection + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.util.SizeEstimator + +/** + * A general interface for collections to keep track of their estimated sizes in bytes. + * We sample with a slow exponential back-off using the SizeEstimator to amortize the time, + * as each call to SizeEstimator is somewhat expensive (order of a few milliseconds). + */ +private[spark] trait SizeTracker { + + import SizeTracker._ + + /** + * Controls the base of the exponential which governs the rate of sampling. + * E.g., a value of 2 would mean we sample at 1, 2, 4, 8, ... elements. + */ + private val SAMPLE_GROWTH_RATE = 1.1 + + /** Samples taken since last resetSamples(). Only the last two are kept for extrapolation. */ + private val samples = new ArrayBuffer[Sample] --- End diff -- Since we only use the last two, can we make this another data structure? You could even make it a Queue. It's kind of confusing to see an ArrayBuffer but then a comment that only two are used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2564. ShuffleReadMetrics.totalBlocksRead...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1474#issuecomment-49536289 QA tests have started for PR 1474. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16861/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2524] missing document about spark.depl...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1443#issuecomment-49536333 Thanks - I can merge this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2564. ShuffleReadMetrics.totalBlocksRead...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1474#issuecomment-49536330 QA results for PR 1474:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16861/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2524] missing document about spark.depl...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1443 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2086] Improve output of toDebugString t...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1364#issuecomment-49536424 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2086] Improve output of toDebugString t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1364#issuecomment-49536483 QA tests have started for PR 1364. This patch DID NOT merge cleanly! brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16862/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2086] Improve output of toDebugString t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1364#issuecomment-49536491 QA results for PR 1364:br- This patch FAILED unit tests.brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16862/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with aggr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1497#issuecomment-49536515 QA results for PR 1497:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16860/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2310. Support arbitrary Spark properties...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1253#issuecomment-49536672 Hey @sryza - I did a straw poll offline discussing this with a few other contributors. The consensus was that it might be better to have a `--conf` flag with an `=` sign instead of representing spark conf properties directly as flags. I.e. `--conf spark.app.name=blah` On admittedly bad thing about this approach is that if users have arguments with spaces in them, they will have to quote the entire thing: ``` ./bin/spark-submit --conf spark.app.name=My app ``` Which might not be intuitive, so it would be good to document that (provided you are generally okay with this proposed syntax). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---