Bug#525301: /usr/bin/isutf8: accepts UTF-8-encoded UTF-16 surrogates
* Lars Wirzenius l...@liw.fi, 2009-05-03, 19:36: $ man utf-8 | grep -A 2 UTF-16 | sed -e 's/^ *//' The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well as 0xfffe and 0x (UCS non-characters) should not appear in conforming UTF-8 streams. $ s='\xed\xa0\x88\xed\xbd\x85' # 0xd808 + 0xdf45 $ printf $s | isutf8 echo $? 0 Thanks for the bug report. You report very clear bugs! Attached is a patch that should fix the issue. Jakub, could you test it and verify that I've understood things correctly and that it really fixes the problem? Looks fine to me. -- Jakub Wilk -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#525301: /usr/bin/isutf8: accepts UTF-8-encoded UTF-16 surrogates
to, 2009-04-23 kello 16:52 +0200, Jakub Wilk kirjoitti: Package: moreutils Version: 0.34 Severity: normal File: /usr/bin/isutf8 $ man utf-8 | grep -A 2 UTF-16 | sed -e 's/^ *//' The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well as 0xfffe and 0x (UCS non-characters) should not appear in conforming UTF-8 streams. $ s='\xed\xa0\x88\xed\xbd\x85' # 0xd808 + 0xdf45 $ printf $s | isutf8 echo $? 0 Thanks for the bug report. You report very clear bugs! Attached is a patch that should fix the issue. Jakub, could you test it and verify that I've understood things correctly and that it really fixes the problem? diff --git a/check-isutf8 b/check-isutf8 index 3abb315..83a4eed 100755 --- a/check-isutf8 +++ b/check-isutf8 @@ -39,5 +39,8 @@ check 1 '\xc2' check 1 '\xc2\x20' check 1 '\x20\xc2' check 1 '\300\200' +check 1 '\xed\xa0\x88\xed\xbd\x85' # UTF-16 surrogates +check 1 '\xef\xbf\xbe' # 0xFFFE +check 1 '\xef\xbf\xbf' # 0x exit $failed diff --git a/isutf8.c b/isutf8.c index 4306c7d..c5f5eeb 100644 --- a/isutf8.c +++ b/isutf8.c @@ -127,6 +127,14 @@ static unsigned long decodeutf8(unsigned char *buf, int nbytes) return INVALID_CHAR; u = (u 6) | (buf[j] 0x3f); } + +/* Conforming UTF-8 cannot contain codes 0xd800–0xdfff (UTF-16 + surrogates) as well as 0xfffe and 0x. */ +if (u = 0xD800 u = 0xDFFF) +return INVALID_CHAR; +if (u == 0xFFFE || u == 0x) +return INVALID_CHAR; + return u; } @@ -145,7 +153,7 @@ static int is_utf8_byte_stream(FILE *file, char *filename, int quiet) { int nbytes, nbytes2; int c; unsigned long code; - unsigned long line, col, byteoff; +unsigned long line, col, byteoff; nbytes = 0; line = 1;
Bug#525301: /usr/bin/isutf8: accepts UTF-8-encoded UTF-16 surrogates
Package: moreutils Version: 0.34 Severity: normal File: /usr/bin/isutf8 $ man utf-8 | grep -A 2 UTF-16 | sed -e 's/^ *//' The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well as 0xfffe and 0x (UCS non-characters) should not appear in conforming UTF-8 streams. $ s='\xed\xa0\x88\xed\xbd\x85' # 0xd808 + 0xdf45 $ printf $s | isutf8 echo $? 0 -- System Information: Debian Release: squeeze/sid APT prefers unstable APT policy: (900, 'unstable'), (500, 'experimental') Architecture: i386 (i686) Kernel: Linux 2.6.26-1-686 (SMP w/2 CPU cores) Locale: LANG=C, LC_CTYPE=pl_PL.utf8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Versions of packages moreutils depends on: ii libc6 2.9-7 GNU C Library: Shared libraries ii perl 5.10.0-19 Larry Wall's Practical Extraction moreutils recommends no packages. Versions of packages moreutils suggests: pn libtime-duration-perl none (no description available) ii libtimedate-perl 1.1600-9 Time and date functions for Perl -- no debconf information -- Jakub Wilk -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org