Re: [twitter-dev] parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

M. Edward (Ed) Borasky Thu, 13 May 2010 15:08:14 -0700

On Thursday, May 13, 2010 02:25:27 pm Raffi Krikorian wrote:
> tweet text can potentially mention other users, lists, contain URLs, and
> contain hashtags -- in fact, something like 50% of tweets contain at least
> one of those.  developers who want to understand the tweet text have to
> parse the text to try to extract those entities (which can get really hard
> and difficult when dealing with unicode characters) and then have to
> potentially make another REST call to resolve that data.  matt sanford
> (@mzsanford) on our internationalization team released the twitter-text
> library (http://github.com/mzsanford/twitter-text-rb) to help making
> parsing easier and standardized (in fact, we use this library ourselves),
> but we on the Platform team wondered if we could make this even easier for
> our developers.
> 
> as part of our JSON and XML payloads, we are going to start supporting an
> entities attribute that will contain this parsed and structured data.
>  you'll see it like so:
> 
> {
>  "text" : "hey @raffi tell @noradio to check out
> http://dev.twitter.com#hot";, ...
>  "entities" : {
>   "user_mentions" : [
>     {
>       "id" : 8285392,
>       "screen_name" : "raffi",
>       "indices" : [4, 9]
>     },
>     {
>       "id" : 3191321,
>       "screen_name" : "noradio",
>       "indices" : [16, 23]
>     }
>   ],
>   "urls" : [
>     {
>       "url" : "http://dev.twitter.com";,
>       "indices" : [38, 64]
>     },
>   ],
>   "hashtags" : [
>     {
>       "text" : "#hot",
>       "indices" : [66, 69]
>       "url" : "http://search.twitter.com/search?q=%23hot";
>     }
>   ]
>  }
>  ...
> }
> 
> or like so
> 
> <status>
>   <text>hey @raffi tell @noradio to check out
> http://dev.twitter.com#hot</text> ...
>   <entities>
>     <user_mentions>
>       <user_mention start="4" end="9">
>         <id>8285392</id>
>         <screen_name>raffi</screen_name>
>       </user_mention>
>       <user_mention start="16" end="23">
>         <id>3191321</id>
>         <screen_name>noradio</screen_name>
>       </user_mention>
>     </user_mentions>
>     <urls>
>       <url start="38" end="64">
>         <url>http://dev.twitter.com</url>
>       </url>
>     </urls>
>     <hashtags>
>       <hashtag start="66" end="69">
>         <text>#hot</text>
>         <url>http://search.twitter.com/search?q=%23hot</url>
>       </hashtag>
>     </hashtags>
>   </entities>
>   ...
> </status>
> 
> as shown above, we'll be parsing out all mentioned users, all lists, all
> included URLs, and all hashtags.  in the case of users, we'll provide you
> their user ID, and for hashtags we'll provide you the query you can run
> against the search API.  and, for all of them, we'll also tell you at what
> character count the entity starts and stops -- that should really take the
> burden off you guys to parse the text properly.
> 
> this entities block will probably be extended later, and these entities are
> just the start.  have we missed anything?  is there anything else you would
> like to see?  as always - just drop us a note, and look for these entities
> to start slowly rolling out.


That's awesome! Saves me hours of run-time regex grief! I wrote a Perl script 
a while back to do some of this. One question - do people often tweet email 
addresses? My Perl script was intended to be a pre-processor for input to 
PostgreSQL, and the PostgreSQL lexer recognizes email addresses. So I put in 
code to recognize those too. But I rarely see email addresses in tweets.
-- 
M. Edward (Ed) Borasky
http://borasky-research.net/m-edward-ed-borasky/ @znmeb

"A mathematician is a device for turning coffee into theorems." ~ Paul Erdős

Re: [twitter-dev] parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

Reply via email to