May I suggest a potentially much more efficient algorithm? Place all
keywords in a HashMap that maps keywords to a list of subscribed
users. Tokenize the status text, and look up each token in the hash
table to deliver the status to each subscribed user. Within the user,
apply a generational filter to prevent duplicate deliveries of the
same status. The statusid as an opaque marker works just fine assuming
single-threaded operation or an appropriately scoped critical section
that atomically completes status delivery to all users. You cannot
assume strictly increasing statusids, so arithmetic comparison other
than equality is a doomed generational index.

This is how the Streaming API implements track (among other things).
Your client is performing the same streaming operations to demultiplex
the stream into your client streams as the Streaming API does to the
Firehose to create your stream. The cost is nearly fixed, as there are
only so many tokens per status. You are limited entirely by memory, as
you can quickly forward statuses to a large number of clients
following a nearly limitless set of keywords.

-John Kalucki
http://twitter.com/jkalucki
Services, Twitter Inc.


On Nov 3, 9:59 am, Fabien Penso <fabienpe...@gmail.com> wrote:
> I agree, however it would help a lot because instead of doing :
>
> for keyword in all_keywords
>  if tweet.match(keyword)
>   //matched, notify users
>  end
> end
>
> we could do
>
> for keyword in keywords_matched
>  // same as above
> end
>
> for matching 5,000 keywords, it would bring the first loop from 5,000
> to probably 1 or 2.
> You know what you matched, so it's quiet easy for you just to include
> row data of matched keywords, I don't need anything fancy. Just space
> separated keywords would help _so much_.
>
> On Tue, Nov 3, 2009 at 3:15 PM, John Kalucki <jkalu...@gmail.com> wrote:
>
> > The assumption is that client services will, in any case, have to
> > parse and route statuses to potentially multiple end-users. Providing
> > this sort of hint wouldn't eliminate the need to parse the status and
> > would likely result in duplicate effort. We're aware that we are, in
> > some use cases, externalizing development effort, but the uses cases
> > for the Streaming API are so many, that it's hard to define exactly
> > how much this feature would help and therefore how much we're
> > externalizing.
>
> > -John Kalucki
> >http://twitter.com/jkalucki
> > Services, Twitter Inc.
>
> > On Nov 3, 1:53 am, Fabien Penso <fabienpe...@gmail.com> wrote:
> >> Hi.
>
> >> Would it be possible to include the matched keywords in another field
> >> within the result from the streaming/keyword API?
>
> >> It would prevent matching those myself when matching for multiple
> >> internal users, to spread the tweets to the legitimate users, which
> >> can be time consuming and tough to do on lots of users/keywords.
>
> >> Thanks.

Reply via email to