This question is directed towards John, but happy to hear how other
folks do it as well.
I've got a couple questions regarding the tokenizing process on the
streaming API. This would be remedied pretty easily with an example
from Twitter as to their tokenizing process/regexp as I'm slightly
confused what keywords would match. It would also be useful so I could
know what tokens a specific status update will generate for an
efficient hash lookup. i.e. if I do a simple split on [^\w]+ regexp,
is that going to generate the correct set of tokens?
In addition to the specifics of tokenizing, the docs state that the
keyword Twitter will not match "twitter.com". In a quick bit of
testing (with delicious url's, less traffic), it seems that the
keyword "icio" will match http://icio.us/. I also currently have open
streams just matching portions of a domain and it appears to be
working. The docs make it seem as if punctuation is matched, but what
is defined as punctuation (by Twitter)? And if the docs are correct,
how would one match "twitter.com"?
Confused in Seattle,