This question is directed towards John, but happy to hear how other folks do it as well.
I've got a couple questions regarding the tokenizing process on the streaming API. This would be remedied pretty easily with an example from Twitter as to their tokenizing process/regexp as I'm slightly confused what keywords would match. It would also be useful so I could know what tokens a specific status update will generate for an efficient hash lookup. i.e. if I do a simple split on [^\w]+ regexp, is that going to generate the correct set of tokens? In addition to the specifics of tokenizing, the docs state that the keyword Twitter will not match "twitter.com". In a quick bit of testing (with delicious url's, less traffic), it seems that the keyword "icio" will match http://icio.us/. I also currently have open streams just matching portions of a domain and it appears to be working. The docs make it seem as if punctuation is matched, but what is defined as punctuation (by Twitter)? And if the docs are correct, how would one match "twitter.com"? Confused in Seattle, Damon/@dacort