In brief: Take all of your search terms and put them into a HashTable
that maps from keyword to subscriber. Tokenize each tweet's text field
and apply each token to the HashTable, sending the Tweet on to all
subscribers. Each subscriber can do a generational deduplication to
avoid getting each tweet twice -- by storing the status id in the
subscriber object.

If each subscriber keeps a copy of their search terms, you can even do
subscriber removal from the HashTable when the subscriber stops their

You can tokenize multi-threaded, but do the hash table apply and hash
table set operations in a single thread. This is plenty of concurrency
and leads to a simple programming model -- and the easy generational
deduplication scheme above.

-John Kalucki
Infrastructure, Twitter Inc.

On Mon, Apr 19, 2010 at 11:28 AM, Jeffrey Greenberg
<> wrote:
> I was unable to attend Chirp in person, so I could not hear John
> Kalucki's comments on this... Anyone have any notes on this... John?
> j
> On Apr 16, 3:36 pm, Jeffrey Greenberg <>
> wrote:
>> So I'm looking at the streaming api (track), and I've got thousands of
>> searches.  ( I mainly need it to deal with
>> terms that are very high volume, and to deal search api rate limiting.
>> The main difficulty I'm thinking about is the best way to de-multiplex
>> the stream back into the individual searches I'm trying to accomplish.
>> 1. How do you handle if the searches are more complex than single
>> terms, but a boolean expression... Do you convert the boolean into
>> something like regex, and then run that regex on every tweet... So if
>> I have several thousand regexs and thousands of tweets, that's a huge
>> amount of processing just todemultiplex... But is that the way to go?
>> 2 And if the search is just a simple expression, do folks 
>> simplydemultiplexby doing a string search for each word in the search for
>> every received tweet... like above?
>> I'm looking for recommended ways todemultiplexthe search stream...
>> Thanks,
>> jeffrey greenberg
>> --
>> Subscription 
>> settings:

Reply via email to