I kind of disagree with you here... not because it's hard to match the
users (the algo you offered is what we use) but because you assume
that queries will juts match 1 single keyword.
I think this is not doable if you start introducing things like + or
or || or , because you need to compare a
Have you researched Vector Space Model (VSM) and cosine theta calculations
or approximations?
You could calculate one of the approximations on the incoming stream
yourself.
Check out this paper http://www.cse.ust.hk/~dlee/Papers/ir/ieee-sw-rank.pdf
Regards,
Bryan
A little late to this convo, but I disagree with the need for this feature.
It adds extra complexity to twitter that really should be on the application
level, and, since the streaming API only returns one tweet, even if it
matched two or more keywords that you are watching, it'd add extra load on
I agree with the idea, since I too have this need, but I think that
you'll still need to check the existence of matches in filtered stream
results. The algorithm used by this API doesn't always return what
you'd expect or need, such as making sure the matches are separate
words, or they are used
The assumption is that client services will, in any case, have to
parse and route statuses to potentially multiple end-users. Providing
this sort of hint wouldn't eliminate the need to parse the status and
would likely result in duplicate effort. We're aware that we are, in
some use cases,
The Streaming API and the Search indexer both tee off the same point
in the new status event pipeline. New statuses are born in the web
containers and queued for a cluster of processes that begin the
offline processing pipeline. This first process does many things,
including routing statuses to
I agree, however it would help a lot because instead of doing :
for keyword in all_keywords
if tweet.match(keyword)
//matched, notify users
end
end
we could do
for keyword in keywords_matched
// same as above
end
for matching 5,000 keywords, it would bring the first loop from 5,000
to
May I suggest a potentially much more efficient algorithm? Place all
keywords in a HashMap that maps keywords to a list of subscribed
users. Tokenize the status text, and look up each token in the hash
table to deliver the status to each subscribed user. Within the user,
apply a generational