Hi, I'm building a topology where I am connecting twitter with another application (ex: appA).
My question is mainly around how to store the results of a streaming topology after computation. appA consists of the following graph model (similar to facebook/twitter) where a user can have followers and follow other users. Ex: UserA follows UserB, UserC, UserD. And UserB/C/D can have any no.of followers. This information is currently stored in an Oracle table. I am retrieving the corresponding twitter ids for users B,C and D and retrieving the latest (n) tweets posted by them. 1) I have a Kafka Spout where I am streaming the tweets for a specific set of userIds. 2) I do a join of the Kafka Spout with the records in oracle table in another Bolt, so that each tweet would be joined with all the users who follow the particular user that posted the particular tweet. 3) After doing a join, I 'll be using a SlidingWindowBolt to capture the latest n tweets posted by all the followers for a given user. What is the best way to store the results of a stream, in this case, to store the results of SlidingWindowBolt. I can use a Redis instance to capture the information. But say if a userA is followed by 100 users, a tweet posted by userA will be duplicated 100 times. To avoid duplication I can just store tweetIds in the outputField of SlidingWindowBlot and tweets can be stored in a separate table. But since tweets are streaming, each record must be associated with an expiration period while being stored (similar to cache). How are such scenarios dealt with usually in a streaming application?Suggestions would be helpful. Thanks Kanagha Kanagha
