Hi Flip,

Thanks for remembering that, Doug. That covers how we determine the languages but I wanted to address the selection of languages. We only currently recognize the languages in the drop-down box on our site. We used to have a few more but with the low amount of tweets in those languages we found that they were wrong about half the time (or, 100% of the time in the case of Esperanto). Because of that I did some analysis and only kept the languages that had a reasonable success rate to make sure the feature was useful. It still mis-classifies some things but overall I'm pretty happy with it. Having said all of that, I never did try out the five languages you mention. I just did a search for Hindi "I" (per Google translate) with &lang=all and it found 0 results. It seems unlikely there are people using Twitter in Hindi but none of them using the word "I". The more likely explanation is that Twitter users in India are using English. If you're a language geek and interested in what else when through my mind while writing this email, continue on to the footer. If you're the normal API developer then there is no need.

Thanks;
  — Matt Sanford / @mzsanford


Heavily Foot-noted and Parenthetical Geekery:

I'm a language geek so I thought about this a bit more after my initial search for "I" in Hindi. In some languages where there are no spaces between words (Chinese), or where there are complex suffix systems (Arabic), searches work poorly fail since we break things up on spaces. Hindi isn't in either of those categories as far as I know, but my reading on Devanagari is limited to only one book [1] and Omniglot.com. Since both of those checked out I started thinking about Volapuk (not the language [2], the encoding [3]), where Cyrillic is written with latin characters, making it more SMS friendly (because, well, SMS as a protocol sucks). I don't know of a Devanagari latin transliteration but I'm pretty sure there is one, as there are for most languages (damn euro-centric linguists). Given all of that I stopped thinking like a language geek and went back to thinking as a normal geek. I searched for near:Mumbai [4]. I do see some results that look like latin transliterated non- English [5] but I would need a huge chunk of data in order to train the language detector to pick things like that out. Since that was the first I found in 100 results that was long enough to be sure, I'm forced to stick with my initial statement. It looks like users in Mumbai are using Twitter in English.
    Now how's *that* for overly detailed answers?

[1] - http://www.amazon.com/Writing-Systems-World-Akira-Nakanishi/dp/0804816549
[2] - http://en.wikipedia.org/wiki/Volapük
[3] - http://en.wikipedia.org/wiki/Volapuk_encoding
[4] - http://search.twitter.com/search?q=near%3AMumbai
[5] - http://twitter.com/dhavalhirdhav/statuses/1115064841 - incorrectly classified as German :(

On Jan 20, 2009, at 06:29 AM, dougw wrote:


flip,
Matt Sanford has a great answer for this already posted:

http://groups.google.com/group/twitter-development-talk/msg/565313d7b36e8d65

@dougw

On Jan 20, 4:46 am, "Philip (flip) Kromer" <[email protected]>
wrote:
I'm unable to get search results for a bunch of languages from India using the lang= query parameter. As it turns out, I don't speak any of them so it's possible I'm doing something dumb. But I've tried a variety of phrases
from the most excellent
Hovercraft<http://www.omniglot.com/language/phrases/ hovercraft.htm>list,
and words I'm sure must match, and have gotten nowhere.

These all work:
  Chinesehttp://search.twitter.com/search.atom?q=tinyurl〈=zh
  Russianhttp://search.twitter.com/search.atom?q=tinyurl〈=ru
  Urduhttp://search.twitter.com/search.atom?q=tinyurl〈=ur
  Farsihttp://search.twitter.com/search.atom?q=tinyurl〈=fa

These all time out:
  Hindihttp://search.twitter.com/search.atom?q=tinyurl〈=hi
Bengalihttp://search.twitter.com/search.atom?q=tinyurl〈=bn Teluguhttp://search.twitter.com/search.atom?q=tinyu rl〈=te
  Marathihttp://search.twitter.com/search.atom?q=tinyurl〈=mr
  Tamilhttp://search.twitter.com/search.atom?q=tinyurl〈=ta

A search for the word 'alu', or 'potato' in Bengali returns no results:

http://search.twitter.com/search.atom?q= %E0%A6%86%E0%A6%B2%E0%A7%81&l...

A search for the word 'it' in
Hindi<http://translate.google.com/translate_t?q=यह> returns
no results:
 http://search.twitter.com/search.atom?q=%E0%A4%AF%E0%A4%B9〈=hi
A search for the word 'Hindi' in
Hindi<http://translate.google.com/translate_t? hl=en&q=हिंदी> returns
no results:

http://search.twitter.com/search.atom?q=%E0%A4%B9%E0%A4%BF %E0%A4%82%E...

The language codes I'm using are from the
 http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
The searches were performed at 3pm Mumbai time / 0900UTC; which is probably max demand for Indian users but I believe is near the global demand minimum.

In the words of Apu Nahasapeemapetilon, "There are over 700 million of us!"
:)

Any advice appreciated,
flip

Reply via email to