We break the status text into tokens by whitespace and punctuation, then
apply the tokens to a hashmap of tracked terms. If the language doesn't have
whitespace, the only thing that will match is the entire Tweet.

I know that Search has struggled with this as well. I take it that the
solutions aren't easy. At some point we'll have to figure something similar
out for Streaming. I've filed a story to add support for these languages in
Track.

-John Kalucki
http://twitter.com/jkalucki
Infrastructure Twitter Inc.


2010/4/7 Toby Phipps <tphi...@gmail.com>

> Hi,
>
> Has anyone managed to get Japanese or Chinese language track
> predicates working with the Streaming API? No matter what I try, I
> fail to get any matches using "track" and any Japanese character, or
> word.
>
> I note from the doc that "Some UTF-8 keywords will not match
> correctly- this is a known temporary defect", however this sounds more
> like an edge case, maybe with with certain denormalized Unicode forms.
> Does this really extend to pretty much any searching in Chinese/
> Japanese?
>
> Some of the predicates I've tried, all which result in no statuses
> arriving:
>
> 日本 ("Japan" - shows up as being very frequent via the search API)
> よ (A Japanese form of exclamation - again very popular in tweets)
> ツイッター (Japanese for Twitter - literally "tsu-i-tta")
>
> Given the talk about a hash map being used for status matching, I'm
> thinking that this could be because no wordbreaking (n-gram/
> morphology) is performed against Chinese/Japanese tweets before they
> get added to the hash map, and since most words aren't space-delimited
> in these languages, if I don't manage to match an entire sentence, I
> won't get a hit. However, all these searches work just fine via the
> search API (which I understand is still on a different platform).
>
> Any ideas?
>
> Thanks,
> Toby.
>

Reply via email to