Hi Matt, I have tried to use language parameter of twitter search and find the result is very unreliable. For example: http://search.twitter.com/search?lang=all&q=tweetjobsearch returns 10 results (all in english), but http://search.twitter.com/search?lang=en&q=tweetjobsearch only returns 3.
I googled this list and it seems you are using n-gram based algorithm (http://groups.google.com/group/twitter-development-talk/msg/ 565313d7b36e8d65). I have found n-gram algorithm works very well for language detection, but the quality of training data may make a big difference. Recently I have developed a language detector (in ruby) myself: http://github.com/feedbackmine/language_detector/tree/master It uses wikipedia's data for training, and based on my limited experience it works well. Actually using wikipedia's data is not my idea, all credits should go to Kevin Burton (http://feedblog.org/ 2005/08/19/ngram-language-categorization-source/ ). Just thought you may be interested. @feedbackmine http://twitter.com/feedbackmine On Mar 31, 11:22 am, Matt Sanford <m...@twitter.com> wrote: > Hi there, > > Can you provide an example URL where since_id isn't working so I > can try and reproduce the issue? As forlanguage, thelanguage > identifier is not a 100% and sometimes makes mistakes. Hopefully not > too many mistakes but it definitely does. > > Thanks; > — Matt Sanford / @mzsanford > > On Mar 31, 2009, at 08:14 AM, codepuke wrote: > > > > > > > Hi all; > > > I see a few people complaining about the since_id not working. I too > > have the same issue - I am currently storing the last executed id and > > having to check new tweets to make sure their id is greater than my > > last processed id as a temporary workaround. > > > I have also noticed that the filter bylanguageparam also doesn't > > seem to be working 100% - I notice a few chinese tweets, as well as > > tweets having a null value forlanguage...