On Thursday, May 13, 2010 02:25:27 pm Raffi Krikorian wrote: > tweet text can potentially mention other users, lists, contain URLs, and > contain hashtags -- in fact, something like 50% of tweets contain at least > one of those. developers who want to understand the tweet text have to > parse the text to try to extract those entities (which can get really hard > and difficult when dealing with unicode characters) and then have to > potentially make another REST call to resolve that data. matt sanford > (@mzsanford) on our internationalization team released the twitter-text > library (http://github.com/mzsanford/twitter-text-rb) to help making > parsing easier and standardized (in fact, we use this library ourselves), > but we on the Platform team wondered if we could make this even easier for > our developers. > > as part of our JSON and XML payloads, we are going to start supporting an > entities attribute that will contain this parsed and structured data. > you'll see it like so: > > { > "text" : "hey @raffi tell @noradio to check out > http://dev.twitter.com#hot", ... > "entities" : { > "user_mentions" : [ > { > "id" : 8285392, > "screen_name" : "raffi", > "indices" : [4, 9] > }, > { > "id" : 3191321, > "screen_name" : "noradio", > "indices" : [16, 23] > } > ], > "urls" : [ > { > "url" : "http://dev.twitter.com", > "indices" : [38, 64] > }, > ], > "hashtags" : [ > { > "text" : "#hot", > "indices" : [66, 69] > "url" : "http://search.twitter.com/search?q=%23hot" > } > ] > } > ... > } > > or like so > > <status> > <text>hey @raffi tell @noradio to check out > http://dev.twitter.com#hot</text> ... > <entities> > <user_mentions> > <user_mention start="4" end="9"> > <id>8285392</id> > <screen_name>raffi</screen_name> > </user_mention> > <user_mention start="16" end="23"> > <id>3191321</id> > <screen_name>noradio</screen_name> > </user_mention> > </user_mentions> > <urls> > <url start="38" end="64"> > <url>http://dev.twitter.com</url> > </url> > </urls> > <hashtags> > <hashtag start="66" end="69"> > <text>#hot</text> > <url>http://search.twitter.com/search?q=%23hot</url> > </hashtag> > </hashtags> > </entities> > ... > </status> > > as shown above, we'll be parsing out all mentioned users, all lists, all > included URLs, and all hashtags. in the case of users, we'll provide you > their user ID, and for hashtags we'll provide you the query you can run > against the search API. and, for all of them, we'll also tell you at what > character count the entity starts and stops -- that should really take the > burden off you guys to parse the text properly. > > this entities block will probably be extended later, and these entities are > just the start. have we missed anything? is there anything else you would > like to see? as always - just drop us a note, and look for these entities > to start slowly rolling out.
That's awesome! Saves me hours of run-time regex grief! I wrote a Perl script a while back to do some of this. One question - do people often tweet email addresses? My Perl script was intended to be a pre-processor for input to PostgreSQL, and the PostgreSQL lexer recognizes email addresses. So I put in code to recognize those too. But I rarely see email addresses in tweets. -- M. Edward (Ed) Borasky http://borasky-research.net/m-edward-ed-borasky/ @znmeb "A mathematician is a device for turning coffee into theorems." ~ Paul Erdős