Re: Slow stage?
Hi Simone, I'm afraid I don't have an answer to your question. However I noticed the DAG figures in the attachment. How did you generate these? I am myself working on a project in which I am trying to generate visual representations of the spark scheduler DAG. If such a tool already exists, I would greatly appreciate any pointers. thanks, --Jakob On 9 November 2015 at 13:52, Simone Franziniwrote: > Hi all, > > I have a complex Spark job that is broken up in many stages. > I have a couple of stages that are particularly slow: each task takes > around 6 - 7 minutes. This stage is fairly complex as you can see from the > attached DAG. However, by construction each of the outer joins will have > only 0 or 1 record on each side. > It seems to me that this stage is really slow. However, the execution > timeline shows that almost 100% of the time is spent in actual execution > time not reading/writing to/from disk or in other overheads. > Does this make any sense? I.e. is it just that these operations are slow > (and notice task size in term of data seems small)? > Is the pattern of operations in the DAG good or is it terribly suboptimal? > If so, how could it be improved? > > > Simone Franzini, PhD > > http://www.linkedin.com/in/simonefranzini > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >
Re: Slow stage?
Those are from the Application Web UI -- look for the "DAG Visualization" and "Event Timeline" elements on Job and Stage pages. On Wed, Nov 11, 2015 at 10:58 AM, Jakob Oderskywrote: > Hi Simone, > I'm afraid I don't have an answer to your question. However I noticed the > DAG figures in the attachment. How did you generate these? I am myself > working on a project in which I am trying to generate visual > representations of the spark scheduler DAG. If such a tool already exists, > I would greatly appreciate any pointers. > > thanks, > --Jakob > > On 9 November 2015 at 13:52, Simone Franzini > wrote: > >> Hi all, >> >> I have a complex Spark job that is broken up in many stages. >> I have a couple of stages that are particularly slow: each task takes >> around 6 - 7 minutes. This stage is fairly complex as you can see from the >> attached DAG. However, by construction each of the outer joins will have >> only 0 or 1 record on each side. >> It seems to me that this stage is really slow. However, the execution >> timeline shows that almost 100% of the time is spent in actual execution >> time not reading/writing to/from disk or in other overheads. >> Does this make any sense? I.e. is it just that these operations are slow >> (and notice task size in term of data seems small)? >> Is the pattern of operations in the DAG good or is it terribly >> suboptimal? If so, how could it be improved? >> >> >> Simone Franzini, PhD >> >> http://www.linkedin.com/in/simonefranzini >> >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > >
Re: Slow stage?
i am a person that usually hates UIs, and i have to say i love these. very useful On Wed, Nov 11, 2015 at 3:23 PM, Mark Hamstrawrote: > Those are from the Application Web UI -- look for the "DAG Visualization" > and "Event Timeline" elements on Job and Stage pages. > > On Wed, Nov 11, 2015 at 10:58 AM, Jakob Odersky > wrote: > >> Hi Simone, >> I'm afraid I don't have an answer to your question. However I noticed the >> DAG figures in the attachment. How did you generate these? I am myself >> working on a project in which I am trying to generate visual >> representations of the spark scheduler DAG. If such a tool already exists, >> I would greatly appreciate any pointers. >> >> thanks, >> --Jakob >> >> On 9 November 2015 at 13:52, Simone Franzini >> wrote: >> >>> Hi all, >>> >>> I have a complex Spark job that is broken up in many stages. >>> I have a couple of stages that are particularly slow: each task takes >>> around 6 - 7 minutes. This stage is fairly complex as you can see from the >>> attached DAG. However, by construction each of the outer joins will have >>> only 0 or 1 record on each side. >>> It seems to me that this stage is really slow. However, the execution >>> timeline shows that almost 100% of the time is spent in actual execution >>> time not reading/writing to/from disk or in other overheads. >>> Does this make any sense? I.e. is it just that these operations are slow >>> (and notice task size in term of data seems small)? >>> Is the pattern of operations in the DAG good or is it terribly >>> suboptimal? If so, how could it be improved? >>> >>> >>> Simone Franzini, PhD >>> >>> http://www.linkedin.com/in/simonefranzini >>> >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >> >> >