Hi Sean,
Thanks for replying but my question was about multiple stages running
the same line of code, not about multiple stages in general. Yes single
job can have multiple stages, but they should not be repeated, as far
as I know, if you're caching/persisting your intermediate outputs.

My question is why am I seeing multiple stages running the same line of
code? As I understand it stage is a grouping of operations that can be
executed without shuffling data or invoking a new action and they are
divided into tasks, and tasks are the ones that are executed in
parallel and can have the same line of code running on different
executors. Or is this assumption wrong?
Thanks,

Joe


On Thu, 2022-04-21 at 09:14 -0500, Sean Owen wrote:
> A job can have multiple stages for sure. One action triggers a job.
> This seems normal. 
> 
> On Thu, Apr 21, 2022, 9:10 AM Joe <j...@net2020.org> wrote:
> > Hi,
> > When looking at application UI (in Amazon EMR) I'm seeing one job
> > for
> > my particular line of code, for example:
> > 64 Running count at MySparkJob.scala:540
> > 
> > When I click into the job and go to stages I can see over a 100
> > stages
> > running the same line of code (stages are active, pending or
> > completed):
> > 190 Pending count at MySparkJob.scala:540
> > ...
> > 162 Active count at MySparkJob.scala:540
> > ...
> > 108 Completed count at MySparkJob.scala:540
> > ...
> > 
> > I'm not sure what that means, I thought that stage was a logical
> > operation boundary and you could have only one stage in the job
> > (unless
> > you executed the same dataset+action many times on purpose) and
> > tasks
> > were the ones that were replicated across partitions. But here I'm
> > seeing many stages running, each with the same line of code?
> > 
> > I don't have a situation where my code is re-processing the same
> > set of
> > data many times, all intermediate sets are persisted.
> > I'm not sure if EMR UI display is wrong or if spark stages are not
> > what
> > I thought they were?
> > Thanks,
> > 
> > Joe
> > 
> > 
> > 
> > -------------------------------------------------------------------
> > --
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> > 



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to