Right now, we go by how Ruby 1.8.x handles String.size. It'll be
Unicode-safe in the future.

On Wed, Jan 7, 2009 at 18:04, zbowling <[email protected]> wrote:
>
> Welcome to UTF-8.
>
> This is something I consult on all the time. The days that encoding
> length equaled character size length and even equaled representation
> length are long gone. It's something you have to break your mind of
> (and it doesn't help that languages like C and C++ call a byte a
> "char".
>
> 1 character can count anywhere from 1 to 5 bytes in some cases.
>
> Basicly:
> U+000000 to U+00007F (basic Latin) = 1 byte - the graceful part of
> UTF-8 is that it is directly equivalent to ASCII in that range.
> U+000080 to U+0007FF - 2 bytes
> U+000800 to U+00FFFF - 3 bytes
> U+010000 to U+10FFFF - 4 bytes
> etc...
>
> See: http://en.wikipedia.org/wiki/UTF-8
>
> Zac Bowling
> http://zbowling.com/
>
>
> On Jan 7, 7:39 pm, benjackson <[email protected]> wrote:
>> Just sent out the following tweet through the API:
>>
>> @gabrielemcrise acho que é um misto de pioneirismo +hype+base de
>> usuários. E também o API, que cercou o serviço de ferramentas
>> interessantes
>>
>> The international characters are being counted more than once and the
>> tweet shows up as:
>>
>> @gabrielemcrise acho que é um misto de pioneirismo +hype+base de
>> usuários. E também o API, que cercou o serviço de ferramentas inter
>



-- 
Alex Payne - API Lead, Twitter, Inc.
http://twitter.com/al3x

Reply via email to