[twitter-dev] Re: Adding more languages to lang parameter in Search API

dbasch Sat, 28 Nov 2009 21:10:48 -0800

Toma,

There are tools you could use to do language detection on your side
and filter out non-Serbian tweets. I assume what slows you down is the
call to Google's Language Detection API:


http://code.google.com/apis/ajaxlanguage/documentation/#Detect

You should try the n-gram based language identifier that comes with
the Nutch search engine. You can build a language model for Serbian
relatively quickly (just feed it a file with a fair amount of text in
Serbian) and see how well it works:

http://wiki.apache.org/nutch/LanguageIdentifier
http://lucene.apache.org/nutch/

Diego

On Nov 29, 1:30 am, Тома Тасовац <transpoet...@gmail.com> wrote:
> Could somebody from the Twitter team please address my question about 
> language recognition in the search API?
>
> Many thanks in advance.
> T.
>
> 25.11.2009, Ò 10:00, Toma ÝÐßØáÐÛ(Ð):
>
> > Hi there.
>
> > I am working on a WordNet-based Serbian-English dictionary (part of
> > Transpoetika Project at the Belgrade Center for Digital Humanities,
> >http://humanistika.org)
>
> > I've implemented a "LiveQuote" system with Twitter, where we get most
> > recent tweets exemplifying the use of a given dictionary entry. We
> > also have several other ideas on how to integrate Twitter in our
> > dictionary application, both on the production and reception ends.
>
> > But we're facing a serious performance issue: Twitter's language
> > parameter (lang) does not recognize Serbian (sr). My workaround has
> > been to use Google Translate's API to check tweets to make sure they
> > are really Serbian. It works, Google is pretty good about this (not
> > 101%, but close enough), but this has considerably slowed down the
> > process -- every tweet we get for a certain word has to be checked
> > with Google before being displayed.
>
> > Without a language check, however, we run into cases where certain
> > Russian, Bulgarian, Macedonian etc. tweets will sometimes sneak into
> > our results thanks to interlingual homographs. For eg. ÖØÒÞâ in
> > Serbian means "life", while in Russian it means "stomach".
>
> > I am curious how you guys check for language identity on your backend,
> > and whether there was any chance you could include Serbian in the
> > list?
>
> > All best,
> > Toma
>
>

[twitter-dev] Re: Adding more languages to lang parameter in Search API

Reply via email to