John, yes, thanks a lot for the design proposal - that is what inspired my own system. I am not primarily filtering by language, however, but by country, so I'm using time zone and location data together with a list of cities from http://www.geonames.org/
The manual cross-check in my thesis shows that this gets you close to 1 in specificity and above .7 in sensitivity. From my experience, the key is to develop efficient language-specific tests with as low an error rate as possible (this, sadly, largely excludes conventional SVM, HMM models etc, because tweets are so short and full of weird punctuation). Pascal On Jul 3, 2010, at 15:26 , John Kalucki wrote: > It's great to hear that someone implemented all this. There's a similar > technique documented here: > http://dev.twitter.com/pages/streaming_api_concepts, under By Language and > Country. My suggestion was to start with a list of stop words to build your > user corpus -- but I don't know how well Farsi works with track, so random > sample method might indeed be better. > > -John Kalucki > http://twitter.com/jkalucki > Infrastructure, Twitter Inc.