Hi,

for some tasks as repartitionbyrange, it is indeed quite annoying sometimes
to wait for the maps to complete before reduce starts.

@Sean Owen <sro...@gmail.com>  do you have any comments?

Regards,
Gourav Sengupta

On Thu, Sep 8, 2022 at 12:10 AM Russell Jurney <russell.jur...@gmail.com>
wrote:

> I could be wrong , but… just start it. If you have the capacity, it takes
> a lot of time on large datasets to reduce the entire dataset. If you have
> the resources, start combining and reducing on partial map results. As soon
> as you’ve got one record out of the map, it has a reduce key in the plan,
> so send it to that reducer. You can’t finish the reduce until you’re done
> with the map, but you can start it immediately. This depends on reducers
> being algebraic, of course, and learning to think in MapReduce isn’t even
> possible for a lot of people. Some people say it is impossible to do it
> well but I disagree :)
>
> On Wed, Sep 7, 2022 at 3:51 PM Sean Owen <sro...@gmail.com> wrote:
>
>> Wait, how do you start reduce tasks before maps are finished? is the idea
>> that some reduce tasks don't depend on all the maps, or at least you can
>> get started?
>> You can already execute unrelated DAGs in parallel of course.
>>
>> On Wed, Sep 7, 2022 at 5:49 PM Sungwoo Park <glap...@gmail.com> wrote:
>>
>>> You are right -- Spark can't do this with its current architecture. My
>>> question was: if there was a new implementation supporting pipelined
>>> execution, what kind of Spark jobs would benefit (a lot) from it?
>>>
>>> Thanks,
>>>
>>> --- Sungwoo
>>>
>>> On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney <russell.jur...@gmail.com>
>>> wrote:
>>>
>>>> I don't think Spark can do this with its current architecture. It has
>>>> to wait for the step to be done, speculative execution isn't possible.
>>>> Others probably know more about why that is.
>>>>
>>>> Thanks,
>>>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>>>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>>>> <http://facebook.com/jurney> datasyndrome.com
>>>>
>>>>
>>>> On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park <glap...@gmail.com> wrote:
>>>>
>>>>> Hello Spark users,
>>>>>
>>>>> I have a question on the architecture of Spark (which could lead to a
>>>>> research problem). In its current implementation, Spark finishes executing
>>>>> all the tasks in a stage before proceeding to child stages. For example,
>>>>> given a two-stage map-reduce DAG, Spark finishes executing all the map
>>>>> tasks before scheduling reduce tasks.
>>>>>
>>>>> We can think of another 'pipelined execution' strategy in which tasks
>>>>> in child stages can be scheduled and executed concurrently with tasks in
>>>>> parent stages. For example, for the two-stage map-reduce DAG, while map
>>>>> tasks are being executed, we could schedule and execute reduce tasks in
>>>>> advance if the cluster has enough resources. These reduce tasks can also
>>>>> pre-fetch the output of map tasks.
>>>>>
>>>>> Has anyone seen Spark jobs for which this 'pipelined execution'
>>>>> strategy would be desirable while the current implementation is not quite
>>>>> adequate? Since Spark tasks usually run for a short period of time, I 
>>>>> guess
>>>>> the new strategy would not have a major performance improvement. However,
>>>>> there might be some category of Spark jobs for which this new strategy
>>>>> would be clearly a better choice.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> --- Sungwoo
>>>>>
>>>>> --
>
> Thanks,
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>

Reply via email to