Oops, it has been long since Russell labored on Hadoop, speculative execution isn’t the right term - that is something else. Cascading has a declarative interface so you can plan more, whereas Spark is more imperative. Point remains :)
On Wed, Sep 7, 2022 at 3:56 PM Russell Jurney <russell.jur...@gmail.com> wrote: > You want to talk to Chris Wensel, creator of cascading, a system that did > speculative execution for a large volume of enterprise workloads. It was > the first approachable way to scale workloads using Hadoop. He could write > a book about this topic. Happy to introduce you if you’d like, or you could > ask on the cascading user group. > > https://cascading.wensel.net/ > > On Wed, Sep 7, 2022 at 3:49 PM Sungwoo Park <glap...@gmail.com> wrote: > >> You are right -- Spark can't do this with its current architecture. My >> question was: if there was a new implementation supporting pipelined >> execution, what kind of Spark jobs would benefit (a lot) from it? >> >> Thanks, >> >> --- Sungwoo >> >> On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney <russell.jur...@gmail.com> >> wrote: >> >>> I don't think Spark can do this with its current architecture. It has to >>> wait for the step to be done, speculative execution isn't possible. Others >>> probably know more about why that is. >>> >>> Thanks, >>> Russell Jurney @rjurney <http://twitter.com/rjurney> >>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB >>> <http://facebook.com/jurney> datasyndrome.com >>> >>> >>> On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park <glap...@gmail.com> wrote: >>> >>>> Hello Spark users, >>>> >>>> I have a question on the architecture of Spark (which could lead to a >>>> research problem). In its current implementation, Spark finishes executing >>>> all the tasks in a stage before proceeding to child stages. For example, >>>> given a two-stage map-reduce DAG, Spark finishes executing all the map >>>> tasks before scheduling reduce tasks. >>>> >>>> We can think of another 'pipelined execution' strategy in which tasks >>>> in child stages can be scheduled and executed concurrently with tasks in >>>> parent stages. For example, for the two-stage map-reduce DAG, while map >>>> tasks are being executed, we could schedule and execute reduce tasks in >>>> advance if the cluster has enough resources. These reduce tasks can also >>>> pre-fetch the output of map tasks. >>>> >>>> Has anyone seen Spark jobs for which this 'pipelined execution' >>>> strategy would be desirable while the current implementation is not quite >>>> adequate? Since Spark tasks usually run for a short period of time, I guess >>>> the new strategy would not have a major performance improvement. However, >>>> there might be some category of Spark jobs for which this new strategy >>>> would be clearly a better choice. >>>> >>>> Thanks, >>>> >>>> --- Sungwoo >>>> >>>> -- > > Thanks, > Russell Jurney @rjurney <http://twitter.com/rjurney> > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB > <http://facebook.com/jurney> datasyndrome.com > -- Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <http://facebook.com/jurney> datasyndrome.com