Thanks a lot for your suggestion, Diego! I will look into it. 
All best,
Toma

29.11.2009, в 6:10, dbasch написал(а):

> Toma,
> 
> There are tools you could use to do language detection on your side
> and filter out non-Serbian tweets. I assume what slows you down is the
> call to Google's Language Detection API:
> 
> http://code.google.com/apis/ajaxlanguage/documentation/#Detect
> 
> You should try the n-gram based language identifier that comes with
> the Nutch search engine. You can build a language model for Serbian
> relatively quickly (just feed it a file with a fair amount of text in
> Serbian) and see how well it works:
> 
> http://wiki.apache.org/nutch/LanguageIdentifier
> http://lucene.apache.org/nutch/
> 
> Diego
> 
> On Nov 29, 1:30 am, Тома Тасовац <transpoet...@gmail.com> wrote:
>> Could somebody from the Twitter team please address my question about 
>> language recognition in the search API?
>> 
>> Many thanks in advance.
>> T.
>> 
>> 25.11.2009, Ò 10:00, Toma ÝÐßØáÐÛ(Ð):
>> 
>>> Hi there.
>> 
>>> I am working on a WordNet-based Serbian-English dictionary (part of
>>> Transpoetika Project at the Belgrade Center for Digital Humanities,
>>> http://humanistika.org)
>> 
>>> I've implemented a "LiveQuote" system with Twitter, where we get most
>>> recent tweets exemplifying the use of a given dictionary entry. We
>>> also have several other ideas on how to integrate Twitter in our
>>> dictionary application, both on the production and reception ends.
>> 
>>> But we're facing a serious performance issue: Twitter's language
>>> parameter (lang) does not recognize Serbian (sr). My workaround has
>>> been to use Google Translate's API to check tweets to make sure they
>>> are really Serbian. It works, Google is pretty good about this (not
>>> 101%, but close enough), but this has considerably slowed down the
>>> process -- every tweet we get for a certain word has to be checked
>>> with Google before being displayed.
>> 
>>> Without a language check, however, we run into cases where certain
>>> Russian, Bulgarian, Macedonian etc. tweets will sometimes sneak into
>>> our results thanks to interlingual homographs. For eg. ÖØÒÞâ in
>>> Serbian means "life", while in Russian it means "stomach".
>> 
>>> I am curious how you guys check for language identity on your backend,
>>> and whether there was any chance you could include Serbian in the
>>> list?
>> 
>>> All best,
>>> Toma
>> 
>> 

Reply via email to