Re: [twitter-dev] Re: UTF-8 and 140 characters still doesn't work?

2010-03-09 Thread Cameron Kaiser
> Raffi asked me about this but since I have a few moments over
> lunch I figured I would reply to the list. It's been so long but it
> feels good. Anyway, the issue is the last two bytes of your URL
> encoded values. From the Ruby irb console I can see:
> 
> >> CGI.unescape("%e3%83")
> => "###"
> >> CGI.unescape("%e3%83").unpack('U*')
> ArgumentError: malformed UTF-8 character (expected 3 bytes, given 2
> bytes)
> from (irb):13:in `unpack'
> from (irb):13
> 
> The issue is that %e3%83 is incomplete UTF-8. The %e3 is expected
> to be followed by two bytes, like the "TE" character [1], which is
> %e3%83%86:
> 
> >> CGI.unescape("%e3%83%86")
> => "___"
> >> CGI.unescape("%e3%83%86").unpack('U*')
> => [12486]
> 
> Since the exact length of the escape sequence is 140 I'm guessing
> there is still some code truncating the value based on byte counts.

Not sure how I missed that. Thanks for the find, Matt. If I still find some
weirdness after correcting that, I'll report back.

-- 
 personal: http://www.cameronkaiser.com/ --
  Cameron Kaiser * Floodgap Systems * www.floodgap.com * ckai...@floodgap.com
-- Knowledge puffs up, but love builds up. -- 1 Corinthians 8:1 ---


[twitter-dev] Re: UTF-8 and 140 characters still doesn't work?

2010-03-09 Thread Matt Sanford
Hi Cameron,

Raffi asked me about this but since I have a few moments over
lunch I figured I would reply to the list. It's been so long but it
feels good. Anyway, the issue is the last two bytes of your URL
encoded values. From the Ruby irb console I can see:

>> CGI.unescape("%e3%83")
=> "###"
>> CGI.unescape("%e3%83").unpack('U*')
ArgumentError: malformed UTF-8 character (expected 3 bytes, given 2
bytes)
from (irb):13:in `unpack'
from (irb):13

The issue is that %e3%83 is incomplete UTF-8. The %e3 is expected
to be followed by two bytes, like the "TE" character [1], which is
%e3%83%86:

>> CGI.unescape("%e3%83%86")
=> "テ"
>> CGI.unescape("%e3%83%86").unpack('U*')
=> [12486]

Since the exact length of the escape sequence is 140 I'm guessing
there is still some code truncating the value based on byte counts.

Thanks;
  — Matt Sanford / Twitter Engineer

[1] - http://www.fileformat.info/info/unicode/char/30c6/index.htm

On Mar 9, 10:35 am, Cameron Kaiser  wrote:
> So I rewrote TTYtter to count in characters instead of bytes, because users
> have been asking for ages for full 140-character tweets, and I was under
> the impression that the API now supported them thanks to Raffi's confirmation.
> Unfortunately, there seems to be a bug as soon as the tweet gets over 140
> bytes (user credentials removed). The Japanese was picked to be exactly 10
> characters long (the "yo" hiragana lands on the 10th character). The return
> block is the response from the server, which is only edited for length. I
> attached the transcript. Notice that as soon as it gets overlength, it bombs.
>
> --
>  personal:http://www.cameronkaiser.com/--
>   Cameron Kaiser * Floodgap Systems *www.floodgap.com* ckai...@floodgap.com
> -- Shady business do not make for sunny life. -- Charlie Chan 
> -
>
>  utft.txt
> 5KViewDownload