On Apr 4, 2014, at 2:01 PM, Hadriel Kaplan <[email protected]> wrote:
> For protocols which are actually truly UTF-8, I'm planning to just assume
> treating them as ASCII is ok, because as far as I know the atoi/strtol/etc.
> functions don't actually care: if they see the ASCII characters for digits
> (and +/-/etc.) they'll parse it, else not. So any non-ASCII UTF-8 character
> in the sequence is meaningless to them and they stop parsing at that
> character.
Yes, the only valid octets in a number in any "extended ASCII" would be:
0x2b, 0x2d, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37
0x38 and 0x39 if the radix is 10 or 16;
0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x61, 0x62, 0x63, 0x64, 0x65, and
0x66 if the radix is 16;
so anything with the 8th bit set is not valid, meaning that the same routine
can handle ASCII, ISO 8859-n, various Windows code pages, various Mac code
pages, and UTF-8 - the actual character encoding is irrelevant, as long as
ASCII characters are encoded as a single octet having the ASCII code point
value.
___________________________________________________________________________
Sent via: Wireshark-dev mailing list <[email protected]>
Archives: http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
mailto:[email protected]?subject=unsubscribe