Thanks a lot for your suggestion, Diego! I will look into it. All best, Toma
29.11.2009, в 6:10, dbasch написал(а): > Toma, > > There are tools you could use to do language detection on your side > and filter out non-Serbian tweets. I assume what slows you down is the > call to Google's Language Detection API: > > http://code.google.com/apis/ajaxlanguage/documentation/#Detect > > You should try the n-gram based language identifier that comes with > the Nutch search engine. You can build a language model for Serbian > relatively quickly (just feed it a file with a fair amount of text in > Serbian) and see how well it works: > > http://wiki.apache.org/nutch/LanguageIdentifier > http://lucene.apache.org/nutch/ > > Diego > > On Nov 29, 1:30 am, Тома Тасовац <transpoet...@gmail.com> wrote: >> Could somebody from the Twitter team please address my question about >> language recognition in the search API? >> >> Many thanks in advance. >> T. >> >> 25.11.2009, Ò 10:00, Toma ÝÐßØáÐÛ(Ð): >> >>> Hi there. >> >>> I am working on a WordNet-based Serbian-English dictionary (part of >>> Transpoetika Project at the Belgrade Center for Digital Humanities, >>> http://humanistika.org) >> >>> I've implemented a "LiveQuote" system with Twitter, where we get most >>> recent tweets exemplifying the use of a given dictionary entry. We >>> also have several other ideas on how to integrate Twitter in our >>> dictionary application, both on the production and reception ends. >> >>> But we're facing a serious performance issue: Twitter's language >>> parameter (lang) does not recognize Serbian (sr). My workaround has >>> been to use Google Translate's API to check tweets to make sure they >>> are really Serbian. It works, Google is pretty good about this (not >>> 101%, but close enough), but this has considerably slowed down the >>> process -- every tweet we get for a certain word has to be checked >>> with Google before being displayed. >> >>> Without a language check, however, we run into cases where certain >>> Russian, Bulgarian, Macedonian etc. tweets will sometimes sneak into >>> our results thanks to interlingual homographs. For eg. ÖØÒÞâ in >>> Serbian means "life", while in Russian it means "stomach". >> >>> I am curious how you guys check for language identity on your backend, >>> and whether there was any chance you could include Serbian in the >>> list? >> >>> All best, >>> Toma >> >>