Re: Tests failing with GC limit exceeded

2017-01-06 Thread shane knapp
(adding michael armbrust and josh rosen for visibility) ok. roughly 9% of all spark tests builds (including both PRB builds are failing due to GC overhead limits. $ wc -l SPARK_TEST_BUILDS GC_FAIL 1350 SPARK_TEST_BUILDS 125 GC_FAIL here are the affected builds (over the past ~2 weeks): $

Re: Tests failing with GC limit exceeded

2017-01-06 Thread shane knapp
On Fri, Jan 6, 2017 at 12:20 PM, shane knapp wrote: > FYI, this is happening across all spark builds... not just the PRB. s/all/almost all/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Tests failing with GC limit exceeded

2017-01-06 Thread shane knapp
FYI, this is happening across all spark builds... not just the PRB. i'm compiling a report now and will email that out this afternoon. :( On Thu, Jan 5, 2017 at 9:00 PM, shane knapp wrote: > unsurprisingly, we had another GC: > >

Re: Tests failing with GC limit exceeded

2017-01-05 Thread shane knapp
unsurprisingly, we had another GC: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70949/console so, definitely not the system (everything looks hunky dory on the build node). > It can always be some memory leak; if we increase the memory settings > and OOMs still happen,

Re: Tests failing with GC limit exceeded

2017-01-05 Thread Kay Ousterhout
But is there any non-memory-leak reason why the tests should need more memory? In theory each test should be cleaning up it's own Spark Context etc. right? My memory is that OOM issues in the tests in the past have been indicative of memory leaks somewhere. I do agree that it doesn't seem likely

Re: Tests failing with GC limit exceeded

2017-01-05 Thread Marcelo Vanzin
On Thu, Jan 5, 2017 at 4:58 PM, Kay Ousterhout wrote: > But is there any non-memory-leak reason why the tests should need more > memory? In theory each test should be cleaning up it's own Spark Context > etc. right? My memory is that OOM issues in the tests in the past

Re: Tests failing with GC limit exceeded

2017-01-05 Thread Marcelo Vanzin
Seems like the OOM is coming from tests, which most probably means it's not an infrastructure issue. Maybe tests just need more memory these days and we need to update maven / sbt scripts. On Thu, Jan 5, 2017 at 1:19 PM, shane knapp wrote: > as of first thing this morning,

Re: Tests failing with GC limit exceeded

2017-01-05 Thread Kay Ousterhout
Thanks for looking into this Shane! On Thu, Jan 5, 2017 at 1:19 PM, shane knapp wrote: > as of first thing this morning, here's the list of recent GC overhead > build failures: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70891/ > console >

Re: Tests failing with GC limit exceeded

2017-01-04 Thread shane knapp
preliminary findings: seems to be transient, and affecting 4% of builds from late december until now (which is as far back as we keep build records for the PRB builds). 408 builds 16 builds.gc <--- failures it's also happening across all workers at about the same rate. and best of all,

Re: Tests failing with GC limit exceeded

2017-01-03 Thread shane knapp
nope, no changes to jenkins in the past few months. ganglia graphs show higher, but not worrying, memory usage on the workers when the jobs failed... i'll take a closer look later tonite/first thing tomorrow morning. shane On Tue, Jan 3, 2017 at 4:35 PM, Kay Ousterhout