We break the status text into tokens by whitespace and punctuation, then apply the tokens to a hashmap of tracked terms. If the language doesn't have whitespace, the only thing that will match is the entire Tweet.
I know that Search has struggled with this as well. I take it that the solutions aren't easy. At some point we'll have to figure something similar out for Streaming. I've filed a story to add support for these languages in Track. -John Kalucki http://twitter.com/jkalucki Infrastructure Twitter Inc. 2010/4/7 Toby Phipps <tphi...@gmail.com> > Hi, > > Has anyone managed to get Japanese or Chinese language track > predicates working with the Streaming API? No matter what I try, I > fail to get any matches using "track" and any Japanese character, or > word. > > I note from the doc that "Some UTF-8 keywords will not match > correctly- this is a known temporary defect", however this sounds more > like an edge case, maybe with with certain denormalized Unicode forms. > Does this really extend to pretty much any searching in Chinese/ > Japanese? > > Some of the predicates I've tried, all which result in no statuses > arriving: > > 日本 ("Japan" - shows up as being very frequent via the search API) > よ (A Japanese form of exclamation - again very popular in tweets) > ツイッター (Japanese for Twitter - literally "tsu-i-tta") > > Given the talk about a hash map being used for status matching, I'm > thinking that this could be because no wordbreaking (n-gram/ > morphology) is performed against Chinese/Japanese tweets before they > get added to the hash map, and since most words aren't space-delimited > in these languages, if I don't manage to match an entire sentence, I > won't get a hit. However, all these searches work just fine via the > search API (which I understand is still on a different platform). > > Any ideas? > > Thanks, > Toby. >