which fields are you doing fieldsGrouping on?  If you do fields grouping on
X and Y, why are you having a race condition in a separate bolt task?  Each
X and Y combo should always go to the same bolt task with fieldsGrouping,
and the scenario you describe should work properly whether you have 1 task,
4 tasks, or 100 tasks.

On Tue, Jan 20, 2015 at 4:11 PM, Kushan Maskey <
[email protected]> wrote:

> Not at the moment. We have been using KafkaSpout for all the other
> projects but have not looked into using trident. How would it help resolve
> the issue we are facing at the moment. We also need to keep in mind the
> development time it would take to implement triedent. While KafkaSpout has
> been working fine with all the other projects.
>
> --
> Kushan Maskey
>
> On Tue, Jan 20, 2015 at 3:05 PM, Rajiv Onat <[email protected]> wrote:
>
>> Seems like stateful processing, have you looked at using trident ?
>>
>> -Rajiv
>>
>> On Jan 20, 2015, at 12:26 PM, Kushan Maskey <
>> [email protected]> wrote:
>>
>> Thanks Keith and Itai,
>>
>> We are using fieldGrouping. Initially we were using suffleGrouping, we
>> saw this problem and then moved to fieldGrouping, with better result, until
>> now. I am thinking due to bolts parallelism which we have set it to 4, is
>> the culprit here. My understanding of parallelism is threading, correct me
>> if I am not incorrect.
>>
>> --
>> Kushan Maskey
>>
>> On Tue, Jan 20, 2015 at 1:03 PM, Itai Frenkel <[email protected]> wrote:
>>
>>>  Hello,
>>>
>>>
>>>  Are you familiar with field grouping ? The idea is that the same bolt
>>> instance would always update the value of a specific key (similar to web
>>> load balancer cookie stickiness).
>>>
>>> https://storm.apache.org/documentation/Concepts.html
>>>
>>> *"Fields grouping**: The stream is partitioned by the fields specified
>>> in the grouping. For example, if the stream is grouped by the "user-id"
>>> field, tuples with the same "user-id" will always go to the same task, but
>>> tuples with different "user-id"'s may go to different tasks."*
>>>
>>>
>>>  ​Itai
>>>
>>>  ------------------------------
>>>
>>> *From:* Kushan Maskey <[email protected]>
>>> *Sent:* Tuesday, January 20, 2015 8:55 PM
>>> *To:* [email protected]
>>> *Subject:* URGENT!! Race condition
>>>
>>>  We are having a major issue trying to update Cassandra database where
>>> we see race condition in a bolt.
>>>
>>>  Here is an example,
>>>
>>>  I have a columnfamily, where i have 2 partitioning columns say X and
>>> Y. There is another columns Z which basically aggregated number. We are
>>> suppose to update Z based on X and Y. Storm is reading a huge volume of
>>> data from Kafka. When sport receives a message, first bolt reads the
>>> database for that combination of X and Y and get the value of Z. Then it
>>> updates the value Z and store it back into the database. Bolt parallelism
>>> is set to be 4 which mean 4 instances of bolt are trying to update the
>>> database. So when first bolt (B1) read the value of Z to be say 100, same
>>> time the second bolt (B2) also read it to be 100, but once B1 completed
>>> execution and the value of Z is now 150, B2 still has 100 so the value of Z
>>> is out of sync.
>>>
>>>  How can we prevent the race condition like this? This is causing a
>>> major nuisance to us.
>>>
>>>  Any help is highly appreciated. Thanks.
>>>
>>>    --
>>> Kushan Maskey
>>>
>>>
>>
>

Reply via email to