Also, take a look at the driver logs -- if there is overhead before the first task is launched, the driver logs would likely reveal this.
On Tue, Apr 8, 2014 at 9:21 AM, Aaron Davidson <ilike...@gmail.com> wrote: > Off the top of my head, the most likely cause would be driver GC issues. > You can diagnose this by enabling GC printing at the driver and you can fix > this by increasing the amount of memory your driver program has (see > http://spark.apache.org/docs/0.9.0/tuning.html#garbage-collection-tuning). > > The "launch time" statistic would also be useful -- if all tasks are > launched at around the same time and complete within 300ms, yet the total > time is 10s, that strongly suggests that the overhead is coming before the > first task is launched. Similarly, it would be useful to know if there was > a large gap between launch times, or if it appeared that the launch times > were serial with respect to the durations. If for some reason Spark started > using only one executor, say, each task would take the same duration but > would be executed one after another. > > > On Tue, Apr 8, 2014 at 8:11 AM, Yana Kadiyska <yana.kadiy...@gmail.com>wrote: > >> Hi Spark users, I'm very much hoping someone can help me out. >> >> I have a strict performance requirement on a particular query. One of >> the stages shows great variance in duration -- from 300ms to 10sec. >> >> The stage is mapPartitionsWithIndex at Operator.scala:210 (running Spark >> 0.8) >> >> I have run the job quite a few times -- the details within the stage >> do not account for the overall duration shown for the stage. What >> could be taking up time that's not showing within the stage breakdown >> UI? Im thinking that reading the data in is reflected in the Duration >> column before, so caching should not be a reason(I'm not caching >> explicitly)? >> >> The details within the stage always show roughly the following (both >> for the 10second and 600ms query -- very little variation, nothing >> over 500ms, ShuffleWrite size is pretty comparable): >> >> StatusLocality LevelExecutor Launch Time DurationGC TimeShuffle Write >> 1864 SUCCESSNODE_LOCAL ####### 301 ms 8 ms 111.0 B >> 1863 SUCCESSNODE_LOCAL ####### 273 ms 102.0 B >> 1862 SUCCESSNODE_LOCAL ####### 245 ms 111.0 B >> 1861 SUCCESSNODE_LOCAL ####### 326 ms 4 ms 102.0 B >> 1860 SUCCESSNODE_LOCAL ####### 217 ms 6 ms 102.0 B >> 1859 SUCCESSNODE_LOCAL ####### 277 ms 111.0 B >> 1858 SUCCESSNODE_LOCAL ####### 262 ms 108.0 B >> 1857 SUCCESSNODE_LOCAL ####### 217 ms 14 ms 112.0 B >> 1856 SUCCESSNODE_LOCAL ####### 208 ms 109.0 B >> 1855 SUCCESSNODE_LOCAL ####### 242 ms 74.0 B >> 1854 SUCCESSNODE_LOCAL ####### 218 ms 3 ms 58.0 B >> 1853 SUCCESSNODE_LOCAL ####### 254 ms 12 ms 102.0 B >> 1852 SUCCESSNODE_LOCAL ####### 274 ms 8 ms 77.0 B >> > >