Raffi:
On May 13, 2:25 pm, Raffi Krikorian <ra...@twitter.com> wrote: > as shown above, we'll be parsing out all mentioned users, all lists, all > included URLs, and all hashtags.... This is an interesting step forward. The internationalisation considerations can be sticky, though. I did some entity-parsing from tweets as part of my "Twanguages" project (a language census of Twitter). One discover was that people are in fact using hashtags with non-latin scripts. Another is that some people are using the '#' character without intending to create a hashtage (e.g. "we are #2 in line"). How will your entity parsing handle non-latin hashtags, latin- character hashtags with accented characters, and strings starting with '#' not intended as hashtags? Also note that URLs can now have non-Latin top-level domain names as well as second-level domain names and other path parts. For instance, http://وزارة-الأتصالات.مصر is a valid URL in the .مصر top-level domain. Will your entity parsing code handle such URLs? In any case, it would be very helpful if the platform team would document exactly what regular expressions govern the entities you recognise. I might not agree with your definition of hashtag syntax, but at least I want to know what it is. See for example the running questions on how to measure the length of a status message. <> >.... matt sanford > (@mzsanford) on our internationalization team released the twitter-text > library (http://github.com/mzsanford/twitter-text-rb) to help making parsing > easier and standardized (in fact, we use this library ourselves), but we on > the Platform team wondered if we could make this even easier for our > developers. ... I wasn't aware of this, and I'll take a look. Thank you for the tip! — Jim DeLaHunt, Vancouver, Canada