Re: File JIRAs for all flaky test failures

2017-03-27 Thread Kay Ousterhout
. Specifically, entries >>> in this feed are test failures which a) occurred in the last week, b) were >>> not part of a build which had 20 or more failed tests, and c) were not >>> observed to fail in during the previous week (i.e. no failures from [2 >>> w

Re: planning & discussion for larger scheduler changes

2017-03-27 Thread Kay Ousterhout
(1) I'm pretty hesitant to merge these larger changes, even if they're feature flagged, because: (a) For some of these changes, it's not obvious that they'll always improve performance. e.g., for SPARK-14649, it's possible that the tasks that got re-started (and temporarily are running in two

File JIRAs for all flaky test failures

2017-02-15 Thread Kay Ousterhout
Hi all, I've noticed the Spark tests getting increasingly flaky -- it seems more common than not now that the tests need to be re-run at least once on PRs before they pass. This is both annoying and problematic because it makes it harder to tell when a PR is introducing new flakiness. To try to

Re: Tests failing with GC limit exceeded

2017-01-05 Thread Kay Ousterhout
08 builds > >> 16 builds.gc <--- failures > >> > >> it's also happening across all workers at about the same rate. > >> > >> and best of all, there seems to be no pattern to which tests are > >> failing (different each time). i'll look a l

Re: Tests failing with GC limit exceeded

2017-01-05 Thread Kay Ousterhout
> > what to do next. > > > > On Tue, Jan 3, 2017 at 6:49 PM, shane knapp <skn...@berkeley.edu> wrote: > >> nope, no changes to jenkins in the past few months. ganglia graphs > >> show higher, but not worrying, memory usage on the workers when the > >

Re: Why is spark.shuffle.sort.bypassMergeThreshold 200?

2017-01-04 Thread Kay Ousterhout
I believe that these two were indeed originally related. In the old hash-based shuffle, we wrote objects out immediately to disk as they were generated by an RDD's iterator. On the other hand, with the original version of the new sort-based shuffle, Spark buffered a bunch of objects before

Duplicate (?) code paths to handle Executor failures

2015-10-25 Thread Kay Ousterhout
Hi all, I noticed that when the JVM for an executor fails, in Standalone mode, we have two duplicate code paths that handle the failure, one via Akka, and the second via the Worker/ExecutorRunner: via Akka: (1) CoarseGrainedSchedulerBackend is notified that the remote Akka endpoint is

Re: Stages with non-arithmetic numbering Timing metrics in event logs

2015-06-11 Thread Kay Ousterhout
Here’s how the shuffle works. This explains what happens for a single task; this will happen in parallel for each task running on the machine, and as Imran said, Spark runs up to “numCores” tasks concurrently on each machine. There's also an answer to the original question about why CPU use is

Re: Stages with non-arithmetic numbering Timing metrics in event logs

2015-06-11 Thread Kay Ousterhout
. This should be preserved for reference somewhere searchable. -Gerard. On Fri, Jun 12, 2015 at 1:19 AM, Kay Ousterhout k...@eecs.berkeley.edu wrote: Here’s how the shuffle works. This explains what happens for a single task; this will happen in parallel for each task running on the machine

Re: Profiling Spark: MemoryStore

2015-03-17 Thread Kay Ousterhout
Hi Alexander, The stack trace is a little misleading here: all of the time is spent in MemoryStore, but that's because MemoryStore is unrolling an iterator (note the iterator.next()) call so that it can be stored in-memory. Essentially all of the computation for the tasks happens as part of that

Re: Spark Summit CFP - Tracks guidelines

2015-02-04 Thread Kay Ousterhout
Did you see the longer descriptions under the Learn More link? Developer This track will present technical deep dive content across a wide range of advanced/basic topics. Data Science This track will focus on the practice of data science using Spark. Sessions should cover innovative techniques,

Re: Can spark provide an option to start reduce stage early?

2015-02-03 Thread Kay Ousterhout
There's a JIRA tracking this here: https://issues.apache.org/jira/browse/SPARK-2387 On Mon, Feb 2, 2015 at 9:48 PM, Xuelin Cao xuelincao2...@gmail.com wrote: In hadoop MR, there is an option *mapred.reduce.slowstart.completed.maps* which can be used to start reducer stage when X% mappers are

Re: Semantics of LGTM

2015-01-17 Thread Kay Ousterhout
+1 to Patrick's proposal of strong LGTM semantics. On past projects, I've heard the semantics of LGTM expressed as I've looked at this thoroughly and take as much ownership as if I wrote the patch myself. My understanding is that this is the level of review we expect for all patches that

keeping PR titles / descriptions up to date

2014-12-02 Thread Kay Ousterhout
Hi all, I've noticed a bunch of times lately where a pull request changes to be pretty different from the original pull request, and the title / description never get updated. Because the pull request title and description are used as the commit message, the incorrect description lives on

Re: Problems with spark.locality.wait

2014-11-13 Thread Kay Ousterhout
Hi, Shivaram and I stumbled across this problem a few weeks ago, and AFAIK there is no nice solution. We worked around it by avoiding jobs with tasks that have tasks with two locality levels. To fix this problem, we really need to fix the underlying problem in the scheduling code, which

Re: Problems with spark.locality.wait

2014-11-13 Thread Kay Ousterhout
Hi Mridul, In the case Shivaram and I saw, and based on my understanding of Ma chong's description, I don't think that completely fixes the problem. To be very concrete, suppose your job has two tasks, t1 and t2, and they each have input data (in HDFS) on h1 and h2, respectively, and that h1 and

Re: [VOTE] Designating maintainers for some Spark components

2014-11-07 Thread Kay Ousterhout
+1 (binding) I see this as a way to increase transparency and efficiency around a process that already informally exists, with benefits to both new contributors and committers. For new contributors, it makes clear who they should ping about a pending patch. For committers, it's a good reference

Re: Replacing Spark's native scheduler with Sparrow

2014-11-07 Thread Kay Ousterhout
Hi Nick, This hasn't yet been directly supported by Spark because of a lack of demand. The last time I ran a throughput test on the default Spark scheduler (~1 year ago, so this may have changed), it could launch approximately 1500 tasks / second. If, for example, you have a cluster of 100

Re: Replacing Spark's native scheduler with Sparrow

2014-11-07 Thread Kay Ousterhout
On Fri, Nov 7, 2014 at 6:20 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: If, for example, you have a cluster of 100 machines, this means the scheduler can launch 150 tasks per machine per second. Did you mean 15 tasks per machine per second here? Or alternatively, 10 machines?

Re: Replacing Spark's native scheduler with Sparrow

2014-11-07 Thread Kay Ousterhout
I don't have much more info than what Shivaram said. My sense is that, over time, task launch overhead with Spark has slowly grown as Spark supports more and more functionality. However, I haven't seen it be as high as the 100ms Michael quoted (maybe this was for jobs with tasks that have much

Re: Surprising Spark SQL benchmark

2014-11-01 Thread Kay Ousterhout
Hi Nick, No -- we're doing a much more constrained thing of just trying to get things set up to easily run TPC-DS on SparkSQL (which involves generating the data, storing it in HDFS, getting all the queries in the right format, etc.). Cloudera does have a repo here:

Re: Surprising Spark SQL benchmark

2014-10-31 Thread Kay Ousterhout
There's been an effort in the AMPLab at Berkeley to set up a shared codebase that makes it easy to run TPC-DS on SparkSQL, since it's something we do frequently in the lab to evaluate new research. Based on this thread, it sounds like making this more widely-available is something that would be

Re: Get attempt number in a closure

2014-10-20 Thread Kay Ousterhout
Are you guys sure this is a bug? In the task scheduler, we keep two identifiers for each task: the index, which uniquely identifiers the computation+partition, and the taskId which is unique across all tasks for that Spark context (See

Re: Get attempt number in a closure

2014-10-20 Thread Kay Ousterhout
). On Mon, Oct 20, 2014 at 1:45 PM, Kay Ousterhout k...@eecs.berkeley.edu wrote: Are you guys sure this is a bug? In the task scheduler, we keep two identifiers for each task: the index, which uniquely identifiers the computation+partition, and the taskId which is unique across all tasks

Re: setting inputMetrics in HadoopRDD#compute()

2014-07-26 Thread Kay Ousterhout
Reynold you're totally right, as discussed offline -- I didn't think about the limit use case when I wrote this. Sandy, is it easy to fix this as part of your patch to use StatisticsData? If not, I can fix it in a separate patch. On Sat, Jul 26, 2014 at 12:12 PM, Reynold Xin

Re: Resource allocations

2014-07-16 Thread Kay Ousterhout
Hi Karthik, The resourceOffer() method is invoked from a class implementing the SchedulerBackend interface; in the case of a standalone cluster, it's invoked from a CoarseGrainedSchedulerBackend (in the makeOffers() method). If you look in TaskSchedulerImpl.submitTasks(), it calls

CPU/Disk/network performance instrumentation

2014-07-09 Thread Kay Ousterhout
Hi all, I've been doing a bunch of performance measurement of Spark and, as part of doing this, added metrics that record the average CPU utilization, disk throughput and utilization for each block device, and network throughput while each task is running. These metrics are collected by reading

Re: ExecutorState.LOADING?

2014-07-09 Thread Kay Ousterhout
Git history to the rescue! It seems to have been added by Matei way back in July 2012: https://github.com/apache/spark/commit/5d1a887bed8423bd6c25660910d18d91880e01fe and then was removed a few months later (replaced by RUNNING) by the same Mr. Zaharia:

FYI -- javax.servlet dependency issue workaround

2014-05-27 Thread Kay Ousterhout
Hi all, I had some trouble compiling an application (Shark) against Spark 1.0, where Shark had a runtime exception (at the bottom of this message) because it couldn't find the javax.servlet classes. SBT seemed to have trouble downloading the servlet APIs that are dependencies of Jetty (used by

Flaky streaming tests

2014-04-07 Thread Kay Ousterhout
Hi all, The InputStreamsSuite seems to have some serious flakiness issues -- I've seen the file input stream fail many times and now I'm seeing some actor input stream test failures ( https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull) on what I think is an

Re: Spark 0.9.1 release

2014-03-25 Thread Kay Ousterhout
I don't think the blacklisting is a priority and the CPUS_PER_TASK issue was still broken after this patch (so broken that I'm convinced no one actually uses this feature!!), so agree with TD's sentiment that this shouldn't go into 0.9.1. On Tue, Mar 25, 2014 at 10:23 PM, Tathagata Das