Hi Adam, Did you see this? I haven't tested it. Just was curious to look around after your post.
http://stackoverflow.com/questions/1550950/detect-chinese-multibyte-character-in-the-string Matt Terenzio On Thu, Mar 24, 2011 at 10:50 AM, Adam Green <140...@gmail.com> wrote: > This has been a problem with collecting tweets from the API since I > started working with it. My users only want English tweets and they > view non-English tweets that I deliver to be a bug in my software. The > lang=en argument in the search API only filters a small percentage of > this, and I know of no way to do any filtering in the streaming API. I > started working with the PHP library call LanguageDetect a few days > ago, and it is doing a great job. > > http://pear.php.net/package/Text_LanguageDetect/ > > I tested it by filtering 40,000 recent tweets about @barackobama from > my 2012twit.com site, and it found almost 20% of the tweets to be non- > English. I screened the ones it found as non-English by hand, and > found less than a 1% false positive rate. That means I lost 0.2% of > the total flow to false positives to eliminate a 20% non-English rate. > Pretty good for a solution that is small, about 2,500 lines of code, > fast, open source, and free. I use it in my tweet parse phase of tweet > collection. First I gather tweets into a MySQL cache with Phirehose, > and then I parse the cached tweets into a normalized scheme. During > this parsing phase I screen each tweet with LanguageDetect. The > additional processing time of language detection is unnoticeable. > > The only limitation I found is that it doesn't detect Chinese or > Japanese, but I think I can find other solutions for this. If anyone > knows of a simple PHP detection algorithm for these languages, please > let me know. > > - Adam Green > Twitter API Developer > http://2012twit.com > http://140dev.com > @140dev > > -- > Twitter developer documentation and resources: http://dev.twitter.com/doc > API updates via Twitter: http://twitter.com/twitterapi > Issues/Enhancements Tracker: > http://code.google.com/p/twitter-api/issues/list > Change your membership to this group: > http://groups.google.com/group/twitter-development-talk > -- Twitter developer documentation and resources: http://dev.twitter.com/doc API updates via Twitter: http://twitter.com/twitterapi Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list Change your membership to this group: http://groups.google.com/group/twitter-development-talk