Hi There,

    I'm sorry this never got updated. Some changes have been made and
are waiting to go out now. When I switched from working on the
Platform (formerly API) team to my focus on international I took over
this issue.
    Once this current fix is deployed (probably in a week or so since
I'm traveling at the moment) the definition of a character will be
consistent throughout our API. The new change will always compute
length based on the Unicode NFC [1] version of the string. Using the
NFC form makes the 140 character limit based on the length as
displayed rather than some under-the-cover byte arithmetic.
    I more than agree with the above statement that a character is a
character and Twitter shouldn't care. Data should be data. The main
issue with that is that some clients compose characters and some
don't. My common example of this is é. Depending on your client
Twitter could get:

é - 1 byte
   - URL Encoded UTF-8: %C3%A9
   - http://www.fileformat.info/info/unicode/char/00e9/index.htm

-- or --

é - 2 bytes
   - URL Encoded UTF-8: %65%CC%81
   - http://www.fileformat.info/info/unicode/char/0065/index.htm
     + plus: http://www.fileformat.info/info/unicode/char/0301/index.htm

    So, my fix will make it so that no matter the client if the user
sees é it counts as a single character. I'll announce something in the
change log once my fix is deployed.

Thanks;
  — Matt Sanford / @mzsanford

[1] - http://www.unicode.org/reports/tr15/

On Sep 9, 6:05 am, TjL <luo...@gmail.com> wrote:
> It's been nearly 6 months. Has this question been answered? If so I missed it.
>
>
>
> On Tue, Mar 24, 2009 at 9:36 PM, Alex Payne<a...@twitter.com> wrote:
>
> > Unfortunately, nothing definitive. We're still looking into this.
>
> > On Tue, Mar 24, 2009 at 07:56, Craig Hockenberry
> > <craig.hockenbe...@gmail.com> wrote:
>
> >> Any news from the Service Team? I'd really like to get the counters
> >> right in an upcoming release...
>
> >> -ch
>
> >> On Mar 6, 12:18 pm, Alex Payne <a...@twitter.com> wrote:
> >>> I'm taking this email to our Service Team, the folks who work on the
> >>> back-end of the service. The whole "message body changing as it moves
> >>> from cache to backing store" thing is totally unacceptable. Answers
> >>> soon.
>
> >>> On Fri, Mar 6, 2009 at 09:43, Craig Hockenberry
>
> >>> <craig.hockenbe...@gmail.com> wrote:
>
> >>> > Some discussion about this thread popped up on Twitter yesterday:
>
> >>> > <http://groups.google.com/group/twitter-development-talk/browse_thread/
> >>> > thread/44be91d5ec5850fa>
>
> >>> > Alex states that it's 140 bytes per tweet. So, of course, Loren
> >>> > Brichter and I tried to prove that. With the following results:
>
> >>> > 1) 140 characters that including ones that include HTML entities:
> >>> > <http://twitter.com/gnitset/status/1286202252>
>
> >>> > At the time of posting, this tweet showed up on the site and in feeds
> >>> > with all 140 characters. After a few hours, the "<" was converted to
> >>> > "&lt;", increasing the count per character from one to four bytes and
> >>> > decreasing the tweet length from 140 characters to 69. (You can see
> >>> > this truncation at the end of the tweet: the "&" is from "&lt;")
>
> >>> > Presumably, this happens as tweets in the memcache are written though
> >>> > to the backing store.
>
> >>> > I also see a lot of Twitter clients that don't realize how special the
> >>> > &lt; and &gt; entities are. It took me a LONG time to figure out what
> >>> > was going on here.
>
> >>> > 2) 140 Unicode _multi-byte_ characters: <http://twitter.com/atebits/
> >>> > status/1286199010>
>
> >>> > What's curious is that Loren's example with 140 characters uses the
> >>> > Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get
> >>> > truncated? This seems to contradict Alex's statement in the thread
> >>> > mentioned above.
>
> >>> > As people start to use things like Emoji, tinyarro.ws and generally
> >>> > figure out that Unicode (UTF-8) is a valid type of data on Twitter,
> >>> > our clients should adapt and display more accurate "characters
> >>> > remaining" counts. I can count bytes instead of characters, but I'm
> >>> > not sure if I should or not.
>
> >>> > No one likes a truncated tweet: we need an explicit statement on how
> >>> > to count and submit multi-byte characters and entities.
>
> >>> > -ch
>
> >>> --
> >>> Alex Payne - API Lead, Twitter, Inc.http://twitter.com/al3x
>
> > --
> > Alex Payne - API Lead, Twitter, Inc.
> >http://twitter.com/al3x

Reply via email to