Right now, we go by how Ruby 1.8.x handles String.size. It'll be Unicode-safe in the future.
On Wed, Jan 7, 2009 at 18:04, zbowling <[email protected]> wrote: > > Welcome to UTF-8. > > This is something I consult on all the time. The days that encoding > length equaled character size length and even equaled representation > length are long gone. It's something you have to break your mind of > (and it doesn't help that languages like C and C++ call a byte a > "char". > > 1 character can count anywhere from 1 to 5 bytes in some cases. > > Basicly: > U+000000 to U+00007F (basic Latin) = 1 byte - the graceful part of > UTF-8 is that it is directly equivalent to ASCII in that range. > U+000080 to U+0007FF - 2 bytes > U+000800 to U+00FFFF - 3 bytes > U+010000 to U+10FFFF - 4 bytes > etc... > > See: http://en.wikipedia.org/wiki/UTF-8 > > Zac Bowling > http://zbowling.com/ > > > On Jan 7, 7:39 pm, benjackson <[email protected]> wrote: >> Just sent out the following tweet through the API: >> >> @gabrielemcrise acho que é um misto de pioneirismo +hype+base de >> usuários. E também o API, que cercou o serviço de ferramentas >> interessantes >> >> The international characters are being counted more than once and the >> tweet shows up as: >> >> @gabrielemcrise acho que é um misto de pioneirismo +hype+base de >> usuários. E também o API, que cercou o serviço de ferramentas inter > -- Alex Payne - API Lead, Twitter, Inc. http://twitter.com/al3x
