----- "John Kalucki" <j...@twitter.com> wrote:

> We break the status text into tokens by whitespace and punctuation,
> then apply the tokens to a hashmap of tracked terms. If the language
> doesn't have whitespace, the only thing that will match is the entire
> Tweet.
> I know that Search has struggled with this as well. I take it that the
> solutions aren't easy. At some point we'll have to figure something
> similar out for Streaming. I've filed a story to add support for these
> languages in Track.
> -John Kalucki
> http://twitter.com/jkalucki
> Infrastructure Twitter Inc.

Thanks! I was just about to add CJK (Chinese - Japanese - Korean) regular 
expressions to my list of research topics! ;-) There must be something in the 
open source world we can (to use the tired old cliché) "leverage off of." ;-) 
Oniguruma?? Namazu?

I suppose we need to look at Cyrillic and right-to-left (Arabic and Hebrew) too?

M. Edward (Ed) Borasky

"A mathematician is a device for turning coffee into theorems." ~ Paul Erdős

Reply via email to