I kind of disagree with you here... not because it's hard to match the
users (the algo you offered is what we use) but because you assume
that queries will juts match 1 single keyword.

I think this is not doable if you start introducing things like + or &
or || or "", because you need to compare a finite list of token + 1
infinite (or almost!) list of combined tokens...

Julien



On Nov 3, 11:41 pm, John Kalucki <jkalu...@gmail.com> wrote:
> May I suggest a potentially much more efficient algorithm? Place all
> keywords in a HashMap that maps keywords to a list of subscribed
> users. Tokenize the status text, and look up each token in the hash
> table to deliver the status to each subscribed user. Within the user,
> apply a generational filter to prevent duplicate deliveries of the
> same status. The statusid as an opaque marker works just fine assuming
> single-threaded operation or an appropriately scoped critical section
> that atomically completes status delivery to all users. You cannot
> assume strictly increasing statusids, so arithmetic comparison other
> than equality is a doomed generational index.
>
> This is how the Streaming API implements track (among other things).
> Your client is performing the same streaming operations to demultiplex
> the stream into your client streams as the Streaming API does to the
> Firehose to create your stream. The cost is nearly fixed, as there are
> only so many tokens per status. You are limited entirely by memory, as
> you can quickly forward statuses to a large number of clients
> following a nearly limitless set of keywords.
>
> -John Kaluckihttp://twitter.com/jkalucki
> Services, Twitter Inc.
>
> On Nov 3, 9:59 am, FabienPenso<fabienpe...@gmail.com> wrote:
>
>
>
> > I agree, however it would help a lot because instead of doing :
>
> > for keyword in all_keywords
> >  if tweet.match(keyword)
> >   //matched, notify users
> >  end
> > end
>
> > we could do
>
> > for keyword in keywords_matched
> >  // same as above
> > end
>
> > for matching 5,000 keywords, it would bring the first loop from 5,000
> > to probably 1 or 2.
> > You know what you matched, so it's quiet easy for you just to include
> > row data of matched keywords, I don't need anything fancy. Just space
> > separated keywords would help _so much_.
>
> > On Tue, Nov 3, 2009 at 3:15 PM, John Kalucki <jkalu...@gmail.com> wrote:
>
> > > The assumption is that client services will, in any case, have to
> > > parse and route statuses to potentially multiple end-users. Providing
> > > this sort of hint wouldn't eliminate the need to parse the status and
> > > would likely result in duplicate effort. We're aware that we are, in
> > > some use cases, externalizing development effort, but the uses cases
> > > for the Streaming API are so many, that it's hard to define exactly
> > > how much this feature would help and therefore how much we're
> > > externalizing.
>
> > > -John Kalucki
> > >http://twitter.com/jkalucki
> > > Services, Twitter Inc.
>
> > > On Nov 3, 1:53 am, FabienPenso<fabienpe...@gmail.com> wrote:
> > >> Hi.
>
> > >> Would it be possible to include the matched keywords in another field
> > >> within the result from the streaming/keyword API?
>
> > >> It would prevent matching those myself when matching for multiple
> > >> internal users, to spread the tweets to the legitimate users, which
> > >> can be time consuming and tough to do on lots of users/keywords.
>
> > >> Thanks.

Reply via email to