[twitter-dev] Counting the bytes in Persian text (and other non English unicode)

Scott Carter Mon, 13 Jul 2009 19:22:41 -0700

One of my users mentioned that my client application was much more
conservative in counting non English unicode bytes (specifically
Persian) than Twitter itself.


I've looked over the following thread and all the other threads
referenced within without discovering a good answer:
http://groups.google.com/group/twitter-development-talk/browse_thread/thread/a1a74365aa6827e7/4a45daee5a993e47

I've noticed a couple different behaviors.  With simple Unicode
(smiley face, arrow, etc), the Twitter Web interface apparently counts
each character as 1 byte when displaying the count.   Posting a long
string of these however can lead to truncation.   It seems perhaps
that the Twitter Javascript is not making an attempt to accurately
count Unicode?

With Persian unicode, the Twitter web interface seems to allow a post
of an amount of text much greater than I would have expected.  My user
provided a sample here that I used to experiment with: http://bit.ly/1LDsVJ

I'm using Javascript code based on the method described here:
http://www.inter-locale.com/demos/countBytes.html

Does anyone have a more accurate counting method in Javascript that
might be better with all types of Unicode?

Thanks,

- Scott
@scott_carter

[twitter-dev] Counting the bytes in Persian text (and other non English unicode)

Reply via email to