John,

yes, thanks a lot for the design proposal - that is what inspired my own 
system. I am not primarily filtering by language, however, but by country, so 
I'm using time zone and location data together with a list of cities from 
http://www.geonames.org/

The manual cross-check in my thesis shows that this gets you close to 1 in 
specificity and above .7 in sensitivity.

From my experience, the key is to develop efficient language-specific tests 
with as low an error rate as possible (this, sadly, largely excludes 
conventional SVM, HMM models etc, because tweets are so short and full of weird 
punctuation).

Pascal

On Jul 3, 2010, at 15:26 , John Kalucki wrote:

> It's great to hear that someone implemented all this. There's a similar 
> technique documented here: 
> http://dev.twitter.com/pages/streaming_api_concepts, under By Language and 
> Country. My suggestion was to start with a list of stop words to build your 
> user corpus -- but I don't know how well Farsi works with track, so random 
> sample method might indeed be better.
> 
> -John Kalucki
> http://twitter.com/jkalucki
> Infrastructure, Twitter Inc.

Reply via email to