Some discussion about this thread popped up on Twitter yesterday: <http://groups.google.com/group/twitter-development-talk/browse_thread/ thread/44be91d5ec5850fa>
Alex states that it's 140 bytes per tweet. So, of course, Loren Brichter and I tried to prove that. With the following results: 1) 140 characters that including ones that include HTML entities: <http://twitter.com/gnitset/status/1286202252> At the time of posting, this tweet showed up on the site and in feeds with all 140 characters. After a few hours, the "<" was converted to "<", increasing the count per character from one to four bytes and decreasing the tweet length from 140 characters to 69. (You can see this truncation at the end of the tweet: the "&" is from "<") Presumably, this happens as tweets in the memcache are written though to the backing store. I also see a lot of Twitter clients that don't realize how special the < and > entities are. It took me a LONG time to figure out what was going on here. 2) 140 Unicode _multi-byte_ characters: <http://twitter.com/atebits/ status/1286199010> What's curious is that Loren's example with 140 characters uses the Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get truncated? This seems to contradict Alex's statement in the thread mentioned above. As people start to use things like Emoji, tinyarro.ws and generally figure out that Unicode (UTF-8) is a valid type of data on Twitter, our clients should adapt and display more accurate "characters remaining" counts. I can count bytes instead of characters, but I'm not sure if I should or not. No one likes a truncated tweet: we need an explicit statement on how to count and submit multi-byte characters and entities. -ch