Fixing the tasks ensures that state is preserved during a re-balance and the 
tuples gets routed to the same task id with fields grouping. Users could be 
storing some state in a bolt (like maintaining some in-memory counter or 
something) without necessarily using a stateful bolt. If the number of tasks 
are changed during a rebalance, this goes for a toss. 

 

If we want to increase the number of tasks during a rebalance, we should handle 
the state migration as well.

 

Right now if you want bolts to execute with increased parallelism during a 
rebalance, you need to over provision the number of tasks. 

 

E.g. You start with parallelism = 2 and tasks = 10. There will be 2 threads 
executing 5 tasks each. Later may be you add more workers and rebalance with 
parallelism = 5, then there will be 5 threads executing 2 tasks each and you 
end up with 5 threads executing your code.

 

Thanks,

Arun

 

From: "Thomas Cooper (PGR)" <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Monday, April 3, 2017 at 8:57 PM
To: "[email protected]" <[email protected]>
Subject: Why Tasks?

 

Hi, 

 

I was hoping that someone on here would be able to help me with a conceptual 
issue?

 

I understand how Storm implements parallelism. I am researching how to model 
the performance of Storm topologies so I have dug around in the source code 
quite a bit. However, I still can't quite wrap my head around tasks.

 

I know they are linked to Fields Groupings, so that a tuple with the same field 
value will always go to the same Executor. If task state was preserved through 
a re-balance then this would make sense as the state would follow the task and 
tuples would continue to be routed correctly. But, as I understand it, by 
default task state is not preserved through a re-balance. In this stateless 
case having tasks doesn't make sense, you could arbitrarily number the 
executors of each component and use those numbers for routing tuples? This 
would remove the upper scaling limit for each component of the topology? 

Of course, if you have a state saving system (statefulBolt etc) tasks make 
sense and having tasks also simplify the hash functions that do the routing. So 
is this the reason they exist and that in the stateless case they are not 
strictly required (other than to make routing simpler)? 

I am concerned that I am missing something fundamental?

 

Thanks in advance, 

 

Thomas Cooper

PhD Student

Newcastle University, School of Computer Science

Twitter: @tomncooper

  • Why Tasks? Thomas Cooper (PGR)
    • Re: Why Tasks? Arun Mahadevan

Reply via email to