The line of code triggers a job, the job triggers stages. You should see they are different operations, all supporting execution of the action on that line.
On Thu, Apr 21, 2022 at 9:24 AM Joe <j...@net2020.org> wrote: > Hi Sean, > Thanks for replying but my question was about multiple stages running > the same line of code, not about multiple stages in general. Yes single > job can have multiple stages, but they should not be repeated, as far > as I know, if you're caching/persisting your intermediate outputs. > > My question is why am I seeing multiple stages running the same line of > code? As I understand it stage is a grouping of operations that can be > executed without shuffling data or invoking a new action and they are > divided into tasks, and tasks are the ones that are executed in > parallel and can have the same line of code running on different > executors. Or is this assumption wrong? > Thanks, > > Joe > > > On Thu, 2022-04-21 at 09:14 -0500, Sean Owen wrote: > > A job can have multiple stages for sure. One action triggers a job. > > This seems normal. > > > > On Thu, Apr 21, 2022, 9:10 AM Joe <j...@net2020.org> wrote: > > > Hi, > > > When looking at application UI (in Amazon EMR) I'm seeing one job > > > for > > > my particular line of code, for example: > > > 64 Running count at MySparkJob.scala:540 > > > > > > When I click into the job and go to stages I can see over a 100 > > > stages > > > running the same line of code (stages are active, pending or > > > completed): > > > 190 Pending count at MySparkJob.scala:540 > > > ... > > > 162 Active count at MySparkJob.scala:540 > > > ... > > > 108 Completed count at MySparkJob.scala:540 > > > ... > > > > > > I'm not sure what that means, I thought that stage was a logical > > > operation boundary and you could have only one stage in the job > > > (unless > > > you executed the same dataset+action many times on purpose) and > > > tasks > > > were the ones that were replicated across partitions. But here I'm > > > seeing many stages running, each with the same line of code? > > > > > > I don't have a situation where my code is re-processing the same > > > set of > > > data many times, all intermediate sets are persisted. > > > I'm not sure if EMR UI display is wrong or if spark stages are not > > > what > > > I thought they were? > > > Thanks, > > > > > > Joe > > > > > > > > > > > > ------------------------------------------------------------------- > > > -- > > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > > >