which fields are you doing fieldsGrouping on? If you do fields grouping on X and Y, why are you having a race condition in a separate bolt task? Each X and Y combo should always go to the same bolt task with fieldsGrouping, and the scenario you describe should work properly whether you have 1 task, 4 tasks, or 100 tasks.
On Tue, Jan 20, 2015 at 4:11 PM, Kushan Maskey < [email protected]> wrote: > Not at the moment. We have been using KafkaSpout for all the other > projects but have not looked into using trident. How would it help resolve > the issue we are facing at the moment. We also need to keep in mind the > development time it would take to implement triedent. While KafkaSpout has > been working fine with all the other projects. > > -- > Kushan Maskey > > On Tue, Jan 20, 2015 at 3:05 PM, Rajiv Onat <[email protected]> wrote: > >> Seems like stateful processing, have you looked at using trident ? >> >> -Rajiv >> >> On Jan 20, 2015, at 12:26 PM, Kushan Maskey < >> [email protected]> wrote: >> >> Thanks Keith and Itai, >> >> We are using fieldGrouping. Initially we were using suffleGrouping, we >> saw this problem and then moved to fieldGrouping, with better result, until >> now. I am thinking due to bolts parallelism which we have set it to 4, is >> the culprit here. My understanding of parallelism is threading, correct me >> if I am not incorrect. >> >> -- >> Kushan Maskey >> >> On Tue, Jan 20, 2015 at 1:03 PM, Itai Frenkel <[email protected]> wrote: >> >>> Hello, >>> >>> >>> Are you familiar with field grouping ? The idea is that the same bolt >>> instance would always update the value of a specific key (similar to web >>> load balancer cookie stickiness). >>> >>> https://storm.apache.org/documentation/Concepts.html >>> >>> *"Fields grouping**: The stream is partitioned by the fields specified >>> in the grouping. For example, if the stream is grouped by the "user-id" >>> field, tuples with the same "user-id" will always go to the same task, but >>> tuples with different "user-id"'s may go to different tasks."* >>> >>> >>> Itai >>> >>> ------------------------------ >>> >>> *From:* Kushan Maskey <[email protected]> >>> *Sent:* Tuesday, January 20, 2015 8:55 PM >>> *To:* [email protected] >>> *Subject:* URGENT!! Race condition >>> >>> We are having a major issue trying to update Cassandra database where >>> we see race condition in a bolt. >>> >>> Here is an example, >>> >>> I have a columnfamily, where i have 2 partitioning columns say X and >>> Y. There is another columns Z which basically aggregated number. We are >>> suppose to update Z based on X and Y. Storm is reading a huge volume of >>> data from Kafka. When sport receives a message, first bolt reads the >>> database for that combination of X and Y and get the value of Z. Then it >>> updates the value Z and store it back into the database. Bolt parallelism >>> is set to be 4 which mean 4 instances of bolt are trying to update the >>> database. So when first bolt (B1) read the value of Z to be say 100, same >>> time the second bolt (B2) also read it to be 100, but once B1 completed >>> execution and the value of Z is now 150, B2 still has 100 so the value of Z >>> is out of sync. >>> >>> How can we prevent the race condition like this? This is causing a >>> major nuisance to us. >>> >>> Any help is highly appreciated. Thanks. >>> >>> -- >>> Kushan Maskey >>> >>> >> >
