[twitter-dev] I found a good solution for PHP language detection in tweets

2011-03-24 Thread Adam Green
This has been a problem with collecting tweets from the API since I
started working with it. My users only want English tweets and they
view non-English tweets that I deliver to be a bug in my software. The
lang=en argument in the search API only filters a small percentage of
this, and I know of no way to do any filtering in the streaming API. I
started working with the PHP library call LanguageDetect a few days
ago, and it is doing a great job.

http://pear.php.net/package/Text_LanguageDetect/

I tested it by filtering 40,000 recent tweets about @barackobama from
my 2012twit.com site, and it found almost 20% of the tweets to be non-
English. I screened the ones it found as non-English by hand, and
found less than a 1% false positive rate. That means I lost 0.2% of
the total flow to false positives to eliminate a 20% non-English rate.
Pretty good for a solution that is small, about 2,500 lines of code,
fast, open source, and free. I use it in my tweet parse phase of tweet
collection. First I gather tweets into a MySQL cache with Phirehose,
and then I parse the cached tweets into a normalized scheme. During
this parsing phase I screen each tweet with LanguageDetect. The
additional processing time of language detection is unnoticeable.

The only limitation I found is that it doesn't detect Chinese or
Japanese, but I think I can find other solutions for this. If anyone
knows of a simple PHP detection algorithm for these languages, please
let me know.

- Adam Green
Twitter API Developer
http://2012twit.com
http://140dev.com
@140dev

-- 
Twitter developer documentation and resources: http://dev.twitter.com/doc
API updates via Twitter: http://twitter.com/twitterapi
Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
Change your membership to this group: 
http://groups.google.com/group/twitter-development-talk


Re: [twitter-dev] I found a good solution for PHP language detection in tweets

2011-03-24 Thread Matthew Terenzio
Hi Adam,

Did you see this? I haven't tested it. Just was curious to look around after
your post.

http://stackoverflow.com/questions/1550950/detect-chinese-multibyte-character-in-the-string

Matt Terenzio

On Thu, Mar 24, 2011 at 10:50 AM, Adam Green 140...@gmail.com wrote:

 This has been a problem with collecting tweets from the API since I
 started working with it. My users only want English tweets and they
 view non-English tweets that I deliver to be a bug in my software. The
 lang=en argument in the search API only filters a small percentage of
 this, and I know of no way to do any filtering in the streaming API. I
 started working with the PHP library call LanguageDetect a few days
 ago, and it is doing a great job.

 http://pear.php.net/package/Text_LanguageDetect/

 I tested it by filtering 40,000 recent tweets about @barackobama from
 my 2012twit.com site, and it found almost 20% of the tweets to be non-
 English. I screened the ones it found as non-English by hand, and
 found less than a 1% false positive rate. That means I lost 0.2% of
 the total flow to false positives to eliminate a 20% non-English rate.
 Pretty good for a solution that is small, about 2,500 lines of code,
 fast, open source, and free. I use it in my tweet parse phase of tweet
 collection. First I gather tweets into a MySQL cache with Phirehose,
 and then I parse the cached tweets into a normalized scheme. During
 this parsing phase I screen each tweet with LanguageDetect. The
 additional processing time of language detection is unnoticeable.

 The only limitation I found is that it doesn't detect Chinese or
 Japanese, but I think I can find other solutions for this. If anyone
 knows of a simple PHP detection algorithm for these languages, please
 let me know.

 - Adam Green
 Twitter API Developer
 http://2012twit.com
 http://140dev.com
 @140dev

 --
 Twitter developer documentation and resources: http://dev.twitter.com/doc
 API updates via Twitter: http://twitter.com/twitterapi
 Issues/Enhancements Tracker:
 http://code.google.com/p/twitter-api/issues/list
 Change your membership to this group:
 http://groups.google.com/group/twitter-development-talk


-- 
Twitter developer documentation and resources: http://dev.twitter.com/doc
API updates via Twitter: http://twitter.com/twitterapi
Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
Change your membership to this group: 
http://groups.google.com/group/twitter-development-talk