Quoting John Kalucki <j...@twitter.com>:

We don't have current plans to fix this issue. The problem isn't around
utf-8, but rather around non-space separated languages. Our language
processing experts described the effort required, and it's a pretty large
project, and may be computationally impractical in the current streaming
architecture. There is a workaround, albeit a generally impractical one:
take the firehose and perform the language parsing on your end.

-John Kalucki
http://twitter.com/jkalucki
Twitter Inc.

Is there a "canonical list" of non-space-separated languages? I'm just starting to look into this myself. There's quite a bit of research available for Chinese, but what are the others? And while we're on the subject, how about right-to-left languages?

Yes, it's a large project, but CJK and Arabic represent large *markets* too. I can understand Twitter needing to prioritize engineering resources, but the marketer in me says such problems could be solved with the application of money and a Twitter lab somewhere in east Asia. ;-)


On Mon, Jun 28, 2010 at 1:34 AM, sjoonk <sjo...@gmail.com> wrote:

I know current Twitter Streaming API do not support utf-8 track
keyword.
As an CJK engineer, I hope this feature will implemented so soon.
Does anybody know when will be supported this feature in Twitter
Streaming API??





Reply via email to