----- "John Kalucki" <j...@twitter.com> wrote: > We break the status text into tokens by whitespace and punctuation, > then apply the tokens to a hashmap of tracked terms. If the language > doesn't have whitespace, the only thing that will match is the entire > Tweet. > > I know that Search has struggled with this as well. I take it that the > solutions aren't easy. At some point we'll have to figure something > similar out for Streaming. I've filed a story to add support for these > languages in Track. > > -John Kalucki > http://twitter.com/jkalucki > Infrastructure Twitter Inc.
Thanks! I was just about to add CJK (Chinese - Japanese - Korean) regular expressions to my list of research topics! ;-) There must be something in the open source world we can (to use the tired old cliché) "leverage off of." ;-) Oniguruma?? Namazu? I suppose we need to look at Cyrillic and right-to-left (Arabic and Hebrew) too? -- M. Edward (Ed) Borasky http://borasky-research.net/smart-at-znmeb "A mathematician is a device for turning coffee into theorems." ~ Paul Erdős