Re: Pipelined execution in Spark (???)

Russell Jurney Wed, 07 Sep 2022 16:05:51 -0700

Oops, it has been long since Russell labored on Hadoop, speculative
execution isn’t the right term - that is something else. Cascading has a
declarative interface so you can plan more, whereas Spark is more
imperative. Point remains :)


On Wed, Sep 7, 2022 at 3:56 PM Russell Jurney <[email protected]>
wrote:

> You want to talk to Chris Wensel, creator of cascading, a system that did
> speculative execution for a large volume of enterprise workloads. It was
> the first approachable way to scale workloads using Hadoop. He could write
> a book about this topic. Happy to introduce you if you’d like, or you could
> ask on the cascading user group.
>
> https://cascading.wensel.net/
>
> On Wed, Sep 7, 2022 at 3:49 PM Sungwoo Park <[email protected]> wrote:
>
>> You are right -- Spark can't do this with its current architecture. My
>> question was: if there was a new implementation supporting pipelined
>> execution, what kind of Spark jobs would benefit (a lot) from it?
>>
>> Thanks,
>>
>> --- Sungwoo
>>
>> On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney <[email protected]>
>> wrote:
>>
>>> I don't think Spark can do this with its current architecture. It has to
>>> wait for the step to be done, speculative execution isn't possible. Others
>>> probably know more about why that is.
>>>
>>> Thanks,
>>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>>> [email protected] LI <http://linkedin.com/in/russelljurney> FB
>>> <http://facebook.com/jurney> datasyndrome.com
>>>
>>>
>>> On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park <[email protected]> wrote:
>>>
>>>> Hello Spark users,
>>>>
>>>> I have a question on the architecture of Spark (which could lead to a
>>>> research problem). In its current implementation, Spark finishes executing
>>>> all the tasks in a stage before proceeding to child stages. For example,
>>>> given a two-stage map-reduce DAG, Spark finishes executing all the map
>>>> tasks before scheduling reduce tasks.
>>>>
>>>> We can think of another 'pipelined execution' strategy in which tasks
>>>> in child stages can be scheduled and executed concurrently with tasks in
>>>> parent stages. For example, for the two-stage map-reduce DAG, while map
>>>> tasks are being executed, we could schedule and execute reduce tasks in
>>>> advance if the cluster has enough resources. These reduce tasks can also
>>>> pre-fetch the output of map tasks.
>>>>
>>>> Has anyone seen Spark jobs for which this 'pipelined execution'
>>>> strategy would be desirable while the current implementation is not quite
>>>> adequate? Since Spark tasks usually run for a short period of time, I guess
>>>> the new strategy would not have a major performance improvement. However,
>>>> there might be some category of Spark jobs for which this new strategy
>>>> would be clearly a better choice.
>>>>
>>>> Thanks,
>>>>
>>>> --- Sungwoo
>>>>
>>>> --
>
> Thanks,
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> [email protected] LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>
-- 

Thanks,
Russell Jurney @rjurney <http://twitter.com/rjurney>
[email protected] LI <http://linkedin.com/in/russelljurney> FB
<http://facebook.com/jurney> datasyndrome.com

Re: Pipelined execution in Spark (???)

Reply via email to