Hi,

I'm building a topology where I am connecting twitter with another
application (ex: appA).

My question is mainly around how to store the results of a streaming
topology after computation.

appA  consists of the following graph model (similar to facebook/twitter)
where a user can have followers and follow other users.

Ex: UserA follows UserB, UserC, UserD.

And UserB/C/D can have any no.of followers.
This information is currently stored in an Oracle table.

I am retrieving the corresponding twitter ids for users B,C and D and
retrieving the latest (n) tweets posted by them.

1) I have a Kafka Spout where I am streaming the tweets for a specific set
of userIds.
2) I do a join of the Kafka Spout with the records in oracle table in
another Bolt, so that each tweet would be joined with all the users who
follow the particular user that posted the particular tweet.
3) After doing a join, I 'll be using a SlidingWindowBolt to capture the
latest n tweets posted by all the followers for a given user.

What is the best way to store the results of a stream, in this case, to
store the results of SlidingWindowBolt.

I can use a Redis instance to capture the information.
But say if a userA is followed by 100 users, a tweet posted by userA will
be duplicated 100 times.

To avoid duplication I can just store tweetIds in the outputField of
SlidingWindowBlot and tweets can be stored in a separate table. But since
tweets are streaming, each record must be associated with an expiration
period while being stored (similar to cache).

How are such scenarios dealt with usually in a streaming
application?Suggestions would be helpful.

Thanks
Kanagha


Kanagha

Reply via email to