Thanks for replying. 

I understand the db approaches. I'm not sure how or they apply to a topology of 
spouts and bolts. 

Being a multithreaded, and potentially a distributed environment, there is no 
guarantee that the order of arrival of updates will be the same as applying 
them before arriving to a terminal bolt. 

I was wondering if anyone had to manage synchronization by a uniquely 
identifiable unit of data, e.g. Record ID, considering that there may be 
millions of them passing through the topology. 

Does Storm itself or a plugin provide any mechanism (e.g. Rolling window) to do 
so?

----- Original Message -----
From: Ambud Sharma <[email protected]>
To: RAMIN FARAJOLLAH, [email protected]
At: 06-May-2017 12:44:45


1. If messages from 2 spouts can trigger updates to the same row you will need 
ideally need to process them using a single thread, if there is a possibility 
updates can be triggered at the same time it will require you to have some sort 
of master epoch and compare timestamps to understand the sequence of applying 
these updates. Additionally if your end database or your schema supports 
versioning that would be the most lock-free setup you can probably achieve. 
This problem generally speaking is out of scope of Storm however for processing 
events for a given row with a rowid e.g. xyz you can use FieldsGrouping that 
will guarantee that all tuples for this row will always go to the same instance 
of a Bolt.

2. As mentioned earlier, you need either locking or versioning to control this. 
For starters reviewing MVCC concept might help.


On Thu, May 4, 2017 at 8:39 AM, Ramin Farajollah (BLOOMBERG/ 731 LEX) 
<[email protected]> wrote:

Hi,

The questions are around sequencing and synchronization of certain tuples.

In my use case, I have a few spouts that act upon millions of cached rows 
before the updated rows successfully exit the topology (published to clients).

A new tuple (an update) from spout A may result in thousands of updated rows. 
The same with spout B, except that the updates may or may not overlap.

Also, performance is important.


The questions are:

1. How can I ensure the updates for each row is applied in the order of 
arrival? (As a given row can be updated from multiple spouts/streams)

2. How can I ensure a new update does not step over in-flight updates? 
(Probably the same as the last question)

Thank you


<< �gA mind is like a parachute. It doesn't work if it is not open.�h Frank 
Zappa >>


<< �gA mind is like a parachute. It doesn&#39;t work if it is not open.�h Frank 
Zappa >>

Reply via email to