Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Nirav Patel
I think it does because user doesn't exactly see their application logic
and flow as spark internal does. Off course we follow general guidelines
for performance but we shouldn't care really how exactly spark decide to
execute DAG. Spark scheduler or core can keep changing over time to
optimize it. So optimizing from user perspective is to look at what
transformation they are using and what they are doing inside those
transformation. If user have some transparency from framework on how those
transformation are utilizing resources over time or where they are failing
we can better optimize it . That way we are focused on our application
logic rather what framework is doing underneath.

About soln, doesn't spark driver (spark context + event listner) have
knowledge of every job, taskset, task and their current state? Spark UI can
relate job to stage to task then why not stage to transformation.

Again my real point is to assess this as an requirement from users,
stakeholders perspective regardless of technical challenge.

Thanks
Nirav

On Wed, May 25, 2016 at 8:04 PM, Mark Hamstra 
wrote:

> But when you talk about optimizing the DAG, it really doesn't make sense
> to also talk about transformation steps as separate entities.  The
> DAGScheduler knows about Jobs, Stages, TaskSets and Tasks.  The
> TaskScheduler knows about TaskSets ad Tasks.  Neither of them understands
> the transformation steps that you used to define your RDD -- at least not
> as separable, distinct steps.  To give the kind of
> transformation-step-oriented information that you want would require parts
> of Spark that don't currently concern themselves at all with RDD
> transformation steps to start tracking them and how they map to Jobs,
> Stages, TaskSets and Tasks -- and when you start talking about Datasets and
> Spark SQL, you then needing to start talking about tracking and mapping
> concepts like Plans, Schemas and Queries.  It would introduce significant
> new complexity.
>
> On Wed, May 25, 2016 at 6:59 PM, Nirav Patel 
> wrote:
>
>> Hi Mark,
>>
>> I might have said stage instead of step in my last statement "UI just
>> says Collect failed but in fact it could be any stage in that lazy chain of
>> evaluation."
>>
>> Anyways even you agree that this visibility of underlaying steps wont't
>> be available. which does pose difficulties in terms of troubleshooting as
>> well as optimizations at step level. I think users will have hard time
>> without this. Its great that spark community working on different levels of
>> internal optimizations but its also important to give enough visibility
>> to users to enable them to debug issues and resolve bottleneck.
>> There is also no visibility into how spark utilizes shuffle memory space
>> vs user memory space vs cache space. It's a separate topic though. If
>> everything is working magically as a black box then it's fine but when you
>> have large number of people on this site complaining about  OOM and shuffle
>> error all the time you need to start providing some transparency to
>> address that.
>>
>> Thanks
>>
>>
>> On Wed, May 25, 2016 at 6:41 PM, Mark Hamstra 
>> wrote:
>>
>>> You appear to be misunderstanding the nature of a Stage.  Individual
>>> transformation steps such as `map` do not define the boundaries of Stages.
>>> Rather, a sequence of transformations in which there is only a
>>> NarrowDependency between each of the transformations will be pipelined into
>>> a single Stage.  It is only when there is a ShuffleDependency that a new
>>> Stage will be defined -- i.e. shuffle boundaries define Stage boundaries.
>>> With whole stage code gen in Spark 2.0, there will be even less opportunity
>>> to treat individual transformations within a sequence of narrow
>>> dependencies as though they were discrete, separable entities.  The Failed
>>> Stages portion of the Web UI will tell you which Stage in a Job failed, and
>>> the accompanying error log message will generally also give you some idea
>>> of which Tasks failed and why.  Tracing the error back further and at a
>>> different level of abstraction to lay blame on a particular transformation
>>> wouldn't be particularly easy.
>>>
>>> On Wed, May 25, 2016 at 5:28 PM, Nirav Patel 
>>> wrote:
>>>
 It's great that spark scheduler does optimized DAG processing and only
 does lazy eval when some action is performed or shuffle dependency is
 encountered. Sometime it goes further after shuffle dep before executing
 anything. e.g. if there are map steps after shuffle then it doesn't stop at
 shuffle to execute anything but goes to that next map steps until it finds
 a reason(spark action) to execute. As a result stage that spark is running
 can be internally series of (map -> shuffle -> map -> map -> collect) and
 spark UI just shows its currently running 'collect' stage. SO  if job fails
 at that point 

Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Mark Hamstra
But when you talk about optimizing the DAG, it really doesn't make sense to
also talk about transformation steps as separate entities.  The
DAGScheduler knows about Jobs, Stages, TaskSets and Tasks.  The
TaskScheduler knows about TaskSets ad Tasks.  Neither of them understands
the transformation steps that you used to define your RDD -- at least not
as separable, distinct steps.  To give the kind of
transformation-step-oriented information that you want would require parts
of Spark that don't currently concern themselves at all with RDD
transformation steps to start tracking them and how they map to Jobs,
Stages, TaskSets and Tasks -- and when you start talking about Datasets and
Spark SQL, you then needing to start talking about tracking and mapping
concepts like Plans, Schemas and Queries.  It would introduce significant
new complexity.

On Wed, May 25, 2016 at 6:59 PM, Nirav Patel  wrote:

> Hi Mark,
>
> I might have said stage instead of step in my last statement "UI just
> says Collect failed but in fact it could be any stage in that lazy chain of
> evaluation."
>
> Anyways even you agree that this visibility of underlaying steps wont't be
> available. which does pose difficulties in terms of troubleshooting as well
> as optimizations at step level. I think users will have hard time without
> this. Its great that spark community working on different levels of
> internal optimizations but its also important to give enough visibility
> to users to enable them to debug issues and resolve bottleneck.
> There is also no visibility into how spark utilizes shuffle memory space
> vs user memory space vs cache space. It's a separate topic though. If
> everything is working magically as a black box then it's fine but when you
> have large number of people on this site complaining about  OOM and shuffle
> error all the time you need to start providing some transparency to
> address that.
>
> Thanks
>
>
> On Wed, May 25, 2016 at 6:41 PM, Mark Hamstra 
> wrote:
>
>> You appear to be misunderstanding the nature of a Stage.  Individual
>> transformation steps such as `map` do not define the boundaries of Stages.
>> Rather, a sequence of transformations in which there is only a
>> NarrowDependency between each of the transformations will be pipelined into
>> a single Stage.  It is only when there is a ShuffleDependency that a new
>> Stage will be defined -- i.e. shuffle boundaries define Stage boundaries.
>> With whole stage code gen in Spark 2.0, there will be even less opportunity
>> to treat individual transformations within a sequence of narrow
>> dependencies as though they were discrete, separable entities.  The Failed
>> Stages portion of the Web UI will tell you which Stage in a Job failed, and
>> the accompanying error log message will generally also give you some idea
>> of which Tasks failed and why.  Tracing the error back further and at a
>> different level of abstraction to lay blame on a particular transformation
>> wouldn't be particularly easy.
>>
>> On Wed, May 25, 2016 at 5:28 PM, Nirav Patel 
>> wrote:
>>
>>> It's great that spark scheduler does optimized DAG processing and only
>>> does lazy eval when some action is performed or shuffle dependency is
>>> encountered. Sometime it goes further after shuffle dep before executing
>>> anything. e.g. if there are map steps after shuffle then it doesn't stop at
>>> shuffle to execute anything but goes to that next map steps until it finds
>>> a reason(spark action) to execute. As a result stage that spark is running
>>> can be internally series of (map -> shuffle -> map -> map -> collect) and
>>> spark UI just shows its currently running 'collect' stage. SO  if job fails
>>> at that point spark UI just says Collect failed but in fact it could be any
>>> stage in that lazy chain of evaluation. Looking at executor logs gives some
>>> insights but that's not always straightforward.
>>> Correct me if I am wrong here but I think we need more visibility into
>>> what's happening underneath so we can easily troubleshoot as well as
>>> optimize our DAG.
>>>
>>> THanks
>>>
>>>
>>>
>>> [image: What's New with Xactly] 
>>>
>>>   [image: LinkedIn]
>>>   [image: Twitter]
>>>   [image: Facebook]
>>>   [image: YouTube]
>>> 
>>
>>
>>
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
> 
>


Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Nirav Patel
Hi Mark,

I might have said stage instead of step in my last statement "UI just says
Collect failed but in fact it could be any stage in that lazy chain of
evaluation."

Anyways even you agree that this visibility of underlaying steps wont't be
available. which does pose difficulties in terms of troubleshooting as well
as optimizations at step level. I think users will have hard time without
this. Its great that spark community working on different levels of
internal optimizations but its also important to give enough visibility to
users to enable them to debug issues and resolve bottleneck.
There is also no visibility into how spark utilizes shuffle memory space vs
user memory space vs cache space. It's a separate topic though. If
everything is working magically as a black box then it's fine but when you
have large number of people on this site complaining about  OOM and shuffle
error all the time you need to start providing some transparency to address
that.

Thanks


On Wed, May 25, 2016 at 6:41 PM, Mark Hamstra 
wrote:

> You appear to be misunderstanding the nature of a Stage.  Individual
> transformation steps such as `map` do not define the boundaries of Stages.
> Rather, a sequence of transformations in which there is only a
> NarrowDependency between each of the transformations will be pipelined into
> a single Stage.  It is only when there is a ShuffleDependency that a new
> Stage will be defined -- i.e. shuffle boundaries define Stage boundaries.
> With whole stage code gen in Spark 2.0, there will be even less opportunity
> to treat individual transformations within a sequence of narrow
> dependencies as though they were discrete, separable entities.  The Failed
> Stages portion of the Web UI will tell you which Stage in a Job failed, and
> the accompanying error log message will generally also give you some idea
> of which Tasks failed and why.  Tracing the error back further and at a
> different level of abstraction to lay blame on a particular transformation
> wouldn't be particularly easy.
>
> On Wed, May 25, 2016 at 5:28 PM, Nirav Patel 
> wrote:
>
>> It's great that spark scheduler does optimized DAG processing and only
>> does lazy eval when some action is performed or shuffle dependency is
>> encountered. Sometime it goes further after shuffle dep before executing
>> anything. e.g. if there are map steps after shuffle then it doesn't stop at
>> shuffle to execute anything but goes to that next map steps until it finds
>> a reason(spark action) to execute. As a result stage that spark is running
>> can be internally series of (map -> shuffle -> map -> map -> collect) and
>> spark UI just shows its currently running 'collect' stage. SO  if job fails
>> at that point spark UI just says Collect failed but in fact it could be any
>> stage in that lazy chain of evaluation. Looking at executor logs gives some
>> insights but that's not always straightforward.
>> Correct me if I am wrong here but I think we need more visibility into
>> what's happening underneath so we can easily troubleshoot as well as
>> optimize our DAG.
>>
>> THanks
>>
>>
>>
>> [image: What's New with Xactly] 
>>
>>   [image: LinkedIn]
>>   [image: Twitter]
>>   [image: Facebook]
>>   [image: YouTube]
>> 
>
>
>

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube] 



Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Mark Hamstra
You appear to be misunderstanding the nature of a Stage.  Individual
transformation steps such as `map` do not define the boundaries of Stages.
Rather, a sequence of transformations in which there is only a
NarrowDependency between each of the transformations will be pipelined into
a single Stage.  It is only when there is a ShuffleDependency that a new
Stage will be defined -- i.e. shuffle boundaries define Stage boundaries.
With whole stage code gen in Spark 2.0, there will be even less opportunity
to treat individual transformations within a sequence of narrow
dependencies as though they were discrete, separable entities.  The Failed
Stages portion of the Web UI will tell you which Stage in a Job failed, and
the accompanying error log message will generally also give you some idea
of which Tasks failed and why.  Tracing the error back further and at a
different level of abstraction to lay blame on a particular transformation
wouldn't be particularly easy.

On Wed, May 25, 2016 at 5:28 PM, Nirav Patel  wrote:

> It's great that spark scheduler does optimized DAG processing and only
> does lazy eval when some action is performed or shuffle dependency is
> encountered. Sometime it goes further after shuffle dep before executing
> anything. e.g. if there are map steps after shuffle then it doesn't stop at
> shuffle to execute anything but goes to that next map steps until it finds
> a reason(spark action) to execute. As a result stage that spark is running
> can be internally series of (map -> shuffle -> map -> map -> collect) and
> spark UI just shows its currently running 'collect' stage. SO  if job fails
> at that point spark UI just says Collect failed but in fact it could be any
> stage in that lazy chain of evaluation. Looking at executor logs gives some
> insights but that's not always straightforward.
> Correct me if I am wrong here but I think we need more visibility into
> what's happening underneath so we can easily troubleshoot as well as
> optimize our DAG.
>
> THanks
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
> 


Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Nirav Patel
It's great that spark scheduler does optimized DAG processing and only does
lazy eval when some action is performed or shuffle dependency is
encountered. Sometime it goes further after shuffle dep before executing
anything. e.g. if there are map steps after shuffle then it doesn't stop at
shuffle to execute anything but goes to that next map steps until it finds
a reason(spark action) to execute. As a result stage that spark is running
can be internally series of (map -> shuffle -> map -> map -> collect) and
spark UI just shows its currently running 'collect' stage. SO  if job fails
at that point spark UI just says Collect failed but in fact it could be any
stage in that lazy chain of evaluation. Looking at executor logs gives some
insights but that's not always straightforward.
Correct me if I am wrong here but I think we need more visibility into
what's happening underneath so we can easily troubleshoot as well as
optimize our DAG.

THanks

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube]