Hi, for some tasks as repartitionbyrange, it is indeed quite annoying sometimes to wait for the maps to complete before reduce starts.
@Sean Owen <sro...@gmail.com> do you have any comments? Regards, Gourav Sengupta On Thu, Sep 8, 2022 at 12:10 AM Russell Jurney <russell.jur...@gmail.com> wrote: > I could be wrong , but… just start it. If you have the capacity, it takes > a lot of time on large datasets to reduce the entire dataset. If you have > the resources, start combining and reducing on partial map results. As soon > as you’ve got one record out of the map, it has a reduce key in the plan, > so send it to that reducer. You can’t finish the reduce until you’re done > with the map, but you can start it immediately. This depends on reducers > being algebraic, of course, and learning to think in MapReduce isn’t even > possible for a lot of people. Some people say it is impossible to do it > well but I disagree :) > > On Wed, Sep 7, 2022 at 3:51 PM Sean Owen <sro...@gmail.com> wrote: > >> Wait, how do you start reduce tasks before maps are finished? is the idea >> that some reduce tasks don't depend on all the maps, or at least you can >> get started? >> You can already execute unrelated DAGs in parallel of course. >> >> On Wed, Sep 7, 2022 at 5:49 PM Sungwoo Park <glap...@gmail.com> wrote: >> >>> You are right -- Spark can't do this with its current architecture. My >>> question was: if there was a new implementation supporting pipelined >>> execution, what kind of Spark jobs would benefit (a lot) from it? >>> >>> Thanks, >>> >>> --- Sungwoo >>> >>> On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney <russell.jur...@gmail.com> >>> wrote: >>> >>>> I don't think Spark can do this with its current architecture. It has >>>> to wait for the step to be done, speculative execution isn't possible. >>>> Others probably know more about why that is. >>>> >>>> Thanks, >>>> Russell Jurney @rjurney <http://twitter.com/rjurney> >>>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB >>>> <http://facebook.com/jurney> datasyndrome.com >>>> >>>> >>>> On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park <glap...@gmail.com> wrote: >>>> >>>>> Hello Spark users, >>>>> >>>>> I have a question on the architecture of Spark (which could lead to a >>>>> research problem). In its current implementation, Spark finishes executing >>>>> all the tasks in a stage before proceeding to child stages. For example, >>>>> given a two-stage map-reduce DAG, Spark finishes executing all the map >>>>> tasks before scheduling reduce tasks. >>>>> >>>>> We can think of another 'pipelined execution' strategy in which tasks >>>>> in child stages can be scheduled and executed concurrently with tasks in >>>>> parent stages. For example, for the two-stage map-reduce DAG, while map >>>>> tasks are being executed, we could schedule and execute reduce tasks in >>>>> advance if the cluster has enough resources. These reduce tasks can also >>>>> pre-fetch the output of map tasks. >>>>> >>>>> Has anyone seen Spark jobs for which this 'pipelined execution' >>>>> strategy would be desirable while the current implementation is not quite >>>>> adequate? Since Spark tasks usually run for a short period of time, I >>>>> guess >>>>> the new strategy would not have a major performance improvement. However, >>>>> there might be some category of Spark jobs for which this new strategy >>>>> would be clearly a better choice. >>>>> >>>>> Thanks, >>>>> >>>>> --- Sungwoo >>>>> >>>>> -- > > Thanks, > Russell Jurney @rjurney <http://twitter.com/rjurney> > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB > <http://facebook.com/jurney> datasyndrome.com >