Some discussion about this thread popped up on Twitter yesterday:

<http://groups.google.com/group/twitter-development-talk/browse_thread/
thread/44be91d5ec5850fa>

Alex states that it's 140 bytes per tweet. So, of course, Loren
Brichter and I tried to prove that. With the following results:

1) 140 characters that including ones that include HTML entities:
<http://twitter.com/gnitset/status/1286202252>

At the time of posting, this tweet showed up on the site and in feeds
with all 140 characters. After a few hours, the "<" was converted to
"&lt;", increasing the count per character from one to four bytes and
decreasing the tweet length from 140 characters to 69. (You can see
this truncation at the end of the tweet: the "&" is from "&lt;")

Presumably, this happens as tweets in the memcache are written though
to the backing store.

I also see a lot of Twitter clients that don't realize how special the
&lt; and &gt; entities are. It took me a LONG time to figure out what
was going on here.

2) 140 Unicode _multi-byte_ characters: <http://twitter.com/atebits/
status/1286199010>

What's curious is that Loren's example with 140 characters uses the
Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get
truncated? This seems to contradict Alex's statement in the thread
mentioned above.

As people start to use things like Emoji, tinyarro.ws and generally
figure out that Unicode (UTF-8) is a valid type of data on Twitter,
our clients should adapt and display more accurate "characters
remaining" counts. I can count bytes instead of characters, but I'm
not sure if I should or not.

No one likes a truncated tweet: we need an explicit statement on how
to count and submit multi-byte characters and entities.

-ch

Reply via email to