Toma, There are tools you could use to do language detection on your side and filter out non-Serbian tweets. I assume what slows you down is the call to Google's Language Detection API:
http://code.google.com/apis/ajaxlanguage/documentation/#Detect You should try the n-gram based language identifier that comes with the Nutch search engine. You can build a language model for Serbian relatively quickly (just feed it a file with a fair amount of text in Serbian) and see how well it works: http://wiki.apache.org/nutch/LanguageIdentifier http://lucene.apache.org/nutch/ Diego On Nov 29, 1:30 am, Тома Тасовац <transpoet...@gmail.com> wrote: > Could somebody from the Twitter team please address my question about > language recognition in the search API? > > Many thanks in advance. > T. > > 25.11.2009, Ò 10:00, Toma ÝÐßØáÐÛ(Ð): > > > Hi there. > > > I am working on a WordNet-based Serbian-English dictionary (part of > > Transpoetika Project at the Belgrade Center for Digital Humanities, > >http://humanistika.org) > > > I've implemented a "LiveQuote" system with Twitter, where we get most > > recent tweets exemplifying the use of a given dictionary entry. We > > also have several other ideas on how to integrate Twitter in our > > dictionary application, both on the production and reception ends. > > > But we're facing a serious performance issue: Twitter's language > > parameter (lang) does not recognize Serbian (sr). My workaround has > > been to use Google Translate's API to check tweets to make sure they > > are really Serbian. It works, Google is pretty good about this (not > > 101%, but close enough), but this has considerably slowed down the > > process -- every tweet we get for a certain word has to be checked > > with Google before being displayed. > > > Without a language check, however, we run into cases where certain > > Russian, Bulgarian, Macedonian etc. tweets will sometimes sneak into > > our results thanks to interlingual homographs. For eg. ÖØÒÞâ in > > Serbian means "life", while in Russian it means "stomach". > > > I am curious how you guys check for language identity on your backend, > > and whether there was any chance you could include Serbian in the > > list? > > > All best, > > Toma > >