Re: Slow stage?

2015-11-11 Thread Jakob Odersky
Hi Simone,
I'm afraid I don't have an answer to your question. However I noticed the
DAG figures in the attachment. How did you generate these? I am myself
working on a project in which I am trying to generate visual
representations of the spark scheduler DAG. If such a tool already exists,
I would greatly appreciate any pointers.

thanks,
--Jakob

On 9 November 2015 at 13:52, Simone Franzini  wrote:

> Hi all,
>
> I have a complex Spark job that is broken up in many stages.
> I have a couple of stages that are particularly slow: each task takes
> around 6 - 7 minutes. This stage is fairly complex as you can see from the
> attached DAG. However, by construction each of the outer joins will have
> only 0 or 1 record on each side.
> It seems to me that this stage is really slow. However, the execution
> timeline shows that almost 100% of the time is spent in actual execution
> time not reading/writing to/from disk or in other overheads.
> Does this make any sense? I.e. is it just that these operations are slow
> (and notice task size in term of data seems small)?
> Is the pattern of operations in the DAG good or is it terribly suboptimal?
> If so, how could it be improved?
>
>
> Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


Re: Slow stage?

2015-11-11 Thread Mark Hamstra
Those are from the Application Web UI -- look for the "DAG Visualization"
and "Event Timeline" elements on Job and Stage pages.

On Wed, Nov 11, 2015 at 10:58 AM, Jakob Odersky  wrote:

> Hi Simone,
> I'm afraid I don't have an answer to your question. However I noticed the
> DAG figures in the attachment. How did you generate these? I am myself
> working on a project in which I am trying to generate visual
> representations of the spark scheduler DAG. If such a tool already exists,
> I would greatly appreciate any pointers.
>
> thanks,
> --Jakob
>
> On 9 November 2015 at 13:52, Simone Franzini 
> wrote:
>
>> Hi all,
>>
>> I have a complex Spark job that is broken up in many stages.
>> I have a couple of stages that are particularly slow: each task takes
>> around 6 - 7 minutes. This stage is fairly complex as you can see from the
>> attached DAG. However, by construction each of the outer joins will have
>> only 0 or 1 record on each side.
>> It seems to me that this stage is really slow. However, the execution
>> timeline shows that almost 100% of the time is spent in actual execution
>> time not reading/writing to/from disk or in other overheads.
>> Does this make any sense? I.e. is it just that these operations are slow
>> (and notice task size in term of data seems small)?
>> Is the pattern of operations in the DAG good or is it terribly
>> suboptimal? If so, how could it be improved?
>>
>>
>> Simone Franzini, PhD
>>
>> http://www.linkedin.com/in/simonefranzini
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>


Re: Slow stage?

2015-11-11 Thread Koert Kuipers
i am a person that usually hates UIs, and i have to say i love these. very
useful

On Wed, Nov 11, 2015 at 3:23 PM, Mark Hamstra 
wrote:

> Those are from the Application Web UI -- look for the "DAG Visualization"
> and "Event Timeline" elements on Job and Stage pages.
>
> On Wed, Nov 11, 2015 at 10:58 AM, Jakob Odersky 
> wrote:
>
>> Hi Simone,
>> I'm afraid I don't have an answer to your question. However I noticed the
>> DAG figures in the attachment. How did you generate these? I am myself
>> working on a project in which I am trying to generate visual
>> representations of the spark scheduler DAG. If such a tool already exists,
>> I would greatly appreciate any pointers.
>>
>> thanks,
>> --Jakob
>>
>> On 9 November 2015 at 13:52, Simone Franzini 
>> wrote:
>>
>>> Hi all,
>>>
>>> I have a complex Spark job that is broken up in many stages.
>>> I have a couple of stages that are particularly slow: each task takes
>>> around 6 - 7 minutes. This stage is fairly complex as you can see from the
>>> attached DAG. However, by construction each of the outer joins will have
>>> only 0 or 1 record on each side.
>>> It seems to me that this stage is really slow. However, the execution
>>> timeline shows that almost 100% of the time is spent in actual execution
>>> time not reading/writing to/from disk or in other overheads.
>>> Does this make any sense? I.e. is it just that these operations are slow
>>> (and notice task size in term of data seems small)?
>>> Is the pattern of operations in the DAG good or is it terribly
>>> suboptimal? If so, how could it be improved?
>>>
>>>
>>> Simone Franzini, PhD
>>>
>>> http://www.linkedin.com/in/simonefranzini
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>>
>