Hi Flip,
Thanks for remembering that, Doug. That covers how we determine
the languages but I wanted to address the selection of languages. We
only currently recognize the languages in the drop-down box on our
site. We used to have a few more but with the low amount of tweets in
those languages we found that they were wrong about half the time (or,
100% of the time in the case of Esperanto). Because of that I did some
analysis and only kept the languages that had a reasonable success
rate to make sure the feature was useful. It still mis-classifies some
things but overall I'm pretty happy with it.
Having said all of that, I never did try out the five languages
you mention. I just did a search for Hindi "I" (per Google translate)
with &lang=all and it found 0 results. It seems unlikely there are
people using Twitter in Hindi but none of them using the word "I". The
more likely explanation is that Twitter users in India are using
English. If you're a language geek and interested in what else when
through my mind while writing this email, continue on to the footer.
If you're the normal API developer then there is no need.
Thanks;
— Matt Sanford / @mzsanford
Heavily Foot-noted and Parenthetical Geekery:
I'm a language geek so I thought about this a bit more after my
initial search for "I" in Hindi. In some languages where there are no
spaces between words (Chinese), or where there are complex suffix
systems (Arabic), searches work poorly fail since we break things up
on spaces. Hindi isn't in either of those categories as far as I know,
but my reading on Devanagari is limited to only one book [1] and
Omniglot.com. Since both of those checked out I started thinking about
Volapuk (not the language [2], the encoding [3]), where Cyrillic is
written with latin characters, making it more SMS friendly (because,
well, SMS as a protocol sucks). I don't know of a Devanagari latin
transliteration but I'm pretty sure there is one, as there are for
most languages (damn euro-centric linguists).
Given all of that I stopped thinking like a language geek and
went back to thinking as a normal geek. I searched for near:Mumbai
[4]. I do see some results that look like latin transliterated non-
English [5] but I would need a huge chunk of data in order to train
the language detector to pick things like that out. Since that was the
first I found in 100 results that was long enough to be sure, I'm
forced to stick with my initial statement. It looks like users in
Mumbai are using Twitter in English.
Now how's *that* for overly detailed answers?
[1] - http://www.amazon.com/Writing-Systems-World-Akira-Nakanishi/dp/0804816549
[2] - http://en.wikipedia.org/wiki/Volapük
[3] - http://en.wikipedia.org/wiki/Volapuk_encoding
[4] - http://search.twitter.com/search?q=near%3AMumbai
[5] - http://twitter.com/dhavalhirdhav/statuses/1115064841 -
incorrectly classified as German :(
On Jan 20, 2009, at 06:29 AM, dougw wrote:
flip,
Matt Sanford has a great answer for this already posted:
http://groups.google.com/group/twitter-development-talk/msg/565313d7b36e8d65
@dougw
On Jan 20, 4:46 am, "Philip (flip) Kromer" <[email protected]>
wrote:
I'm unable to get search results for a bunch of languages from
India using
the lang= query parameter. As it turns out, I don't speak any of
them so
it's possible I'm doing something dumb. But I've tried a variety of
phrases
from the most excellent
Hovercraft<http://www.omniglot.com/language/phrases/
hovercraft.htm>list,
and words I'm sure must match, and have gotten nowhere.
These all work:
Chinesehttp://search.twitter.com/search.atom?q=tinyurl〈=zh
Russianhttp://search.twitter.com/search.atom?q=tinyurl〈=ru
Urduhttp://search.twitter.com/search.atom?q=tinyurl〈=ur
Farsihttp://search.twitter.com/search.atom?q=tinyurl〈=fa
These all time out:
Hindihttp://search.twitter.com/search.atom?q=tinyurl〈=hi
Bengalihttp://search.twitter.com/search.atom?q=tinyurl〈=bn Teluguhttp://search.twitter.com/search.atom?q=tinyu
rl〈=te
Marathihttp://search.twitter.com/search.atom?q=tinyurl〈=mr
Tamilhttp://search.twitter.com/search.atom?q=tinyurl〈=ta
A search for the word 'alu', or 'potato' in Bengali returns no
results:
http://search.twitter.com/search.atom?q=
%E0%A6%86%E0%A6%B2%E0%A7%81&l...
A search for the word 'it' in
Hindi<http://translate.google.com/translate_t?q=यह> returns
no results:
http://search.twitter.com/search.atom?q=%E0%A4%AF%E0%A4%B9〈=hi
A search for the word 'Hindi' in
Hindi<http://translate.google.com/translate_t?
hl=en&q=हिंदी> returns
no results:
http://search.twitter.com/search.atom?q=%E0%A4%B9%E0%A4%BF
%E0%A4%82%E...
The language codes I'm using are from the
http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
The searches were performed at 3pm Mumbai time / 0900UTC; which is
probably
max demand for Indian users but I believe is near the global demand
minimum.
In the words of Apu Nahasapeemapetilon, "There are over 700 million
of us!"
:)
Any advice appreciated,
flip