[twitter-dev] Re: Recommended ways to demultiplex the search stream with thousands of searches

Jeffrey Greenberg Mon, 19 Apr 2010 14:41:44 -0700

Just to clarify:
if i have thousands of boolean searches that map to the current search
capability, and If I want to map all or some of those into Twitter
Streaming API, I have to deal with the fact that streams don't support
boolean expressions, just direct single term matches.  So I must
either create a homebrew boolean production scheme (e.g. the regex
idea I mentioned at the start) or via a heavier weight free-text
search capability (e.g. lucene).


Is that right?

jeffrey greenberg



On Apr 19, 1:52 pm, John Kalucki <j...@twitter.com> wrote:
> In brief: Take all of your search terms and put them into a HashTable
> that maps from keyword to subscriber. Tokenize each tweet's text field
> and apply each token to the HashTable, sending the Tweet on to all
> subscribers. Each subscriber can do a generational deduplication to
> avoid getting each tweet twice -- by storing the status id in the
> subscriber object.
>
> If each subscriber keeps a copy of their search terms, you can even do
> subscriber removal from the HashTable when the subscriber stops their
> query.
>
> You can tokenize multi-threaded, but do the hash table apply and hash
> table set operations in a single thread. This is plenty of concurrency
> and leads to a simple programming model -- and the easy generational
> deduplication scheme above.
>
> -John Kaluckihttp://twitter.com/jkalucki
> Infrastructure, Twitter Inc.
>
> On Mon, Apr 19, 2010 at 11:28 AM, Jeffrey Greenberg
>
> <jeffreygreenb...@gmail.com> wrote:
> > I was unable to attend Chirp in person, so I could not hear John
> > Kalucki's comments on this... Anyone have any notes on this... John?
>
> > j
>
> > On Apr 16, 3:36 pm, Jeffrey Greenberg <jeffreygreenb...@gmail.com>
> > wrote:
> >> So I'm looking at the streaming api (track), and I've got thousands of
> >> searches.  (http://tweettronics.com) I mainly need it to deal with
> >> terms that are very high volume, and to deal search api rate limiting.
>
> >> The main difficulty I'm thinking about is the best way to de-multiplex
> >> the stream back into the individual searches I'm trying to accomplish.
>
> >> 1. How do you handle if the searches are more complex than single
> >> terms, but a boolean expression... Do you convert the boolean into
> >> something like regex, and then run that regex on every tweet... So if
> >> I have several thousand regexs and thousands of tweets, that's a huge
> >> amount of processing just todemultiplex... But is that the way to go?
> >> 2 And if the search is just a simple expression, do folks 
> >> simplydemultiplexby doing a string search for each word in the search for
> >> every received tweet... like above?
>
> >> I'm looking for recommended ways todemultiplexthe search stream...
>
> >> Thanks,
> >> jeffrey greenberg
>
> >> --
> >> Subscription 
> >> settings:http://groups.google.com/group/twitter-development-talk/subscribe?hl=en

[twitter-dev] Re: Recommended ways to demultiplex the search stream with thousands of searches

Reply via email to