Bug#525301: /usr/bin/isutf8: accepts UTF-8-encoded UTF-16 surrogates

2009-05-05 Thread Jakub Wilk

* Lars Wirzenius l...@liw.fi, 2009-05-03, 19:36:

$ man utf-8 | grep -A 2 UTF-16 | sed -e 's/^ *//'
The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well as 0xfffe
and 0x (UCS non-characters) should not appear in  conforming  UTF-8
streams.

$ s='\xed\xa0\x88\xed\xbd\x85' # 0xd808 + 0xdf45
$ printf $s | isutf8  echo $?
0


Thanks for the bug report. You report very clear bugs!

Attached is a patch that should fix the issue. Jakub, could you test it
and verify that I've understood things correctly and that it really
fixes the problem?

Looks fine to me.

--
Jakub Wilk



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#525301: /usr/bin/isutf8: accepts UTF-8-encoded UTF-16 surrogates

2009-05-03 Thread Lars Wirzenius
to, 2009-04-23 kello 16:52 +0200, Jakub Wilk kirjoitti:
 Package: moreutils
 Version: 0.34
 Severity: normal
 File: /usr/bin/isutf8
 
 $ man utf-8 | grep -A 2 UTF-16 | sed -e 's/^ *//'
 The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well as 0xfffe
 and 0x (UCS non-characters) should not appear in  conforming  UTF-8
 streams.
 
 $ s='\xed\xa0\x88\xed\xbd\x85' # 0xd808 + 0xdf45
 $ printf $s | isutf8  echo $?
 0

Thanks for the bug report. You report very clear bugs!

Attached is a patch that should fix the issue. Jakub, could you test it
and verify that I've understood things correctly and that it really
fixes the problem?
diff --git a/check-isutf8 b/check-isutf8
index 3abb315..83a4eed 100755
--- a/check-isutf8
+++ b/check-isutf8
@@ -39,5 +39,8 @@ check 1 '\xc2'
 check 1 '\xc2\x20'
 check 1 '\x20\xc2'
 check 1 '\300\200'
+check 1 '\xed\xa0\x88\xed\xbd\x85' # UTF-16 surrogates
+check 1 '\xef\xbf\xbe' # 0xFFFE
+check 1 '\xef\xbf\xbf' # 0x
 
 exit $failed
diff --git a/isutf8.c b/isutf8.c
index 4306c7d..c5f5eeb 100644
--- a/isutf8.c
+++ b/isutf8.c
@@ -127,6 +127,14 @@ static unsigned long decodeutf8(unsigned char *buf, int nbytes)
 return INVALID_CHAR;
 u = (u  6) | (buf[j]  0x3f);
 }
+
+/* Conforming UTF-8 cannot contain codes 0xd800–0xdfff (UTF-16 
+   surrogates) as well as 0xfffe and 0x. */
+if (u = 0xD800  u = 0xDFFF)
+return INVALID_CHAR;
+if (u == 0xFFFE || u == 0x)
+return INVALID_CHAR;
+
 return u;
 }
 
@@ -145,7 +153,7 @@ static int is_utf8_byte_stream(FILE *file, char *filename, int quiet) {
 int nbytes, nbytes2;
 int c;
 unsigned long code;
-	unsigned long line, col, byteoff;
+unsigned long line, col, byteoff;
 
 nbytes = 0;
 line = 1;


Bug#525301: /usr/bin/isutf8: accepts UTF-8-encoded UTF-16 surrogates

2009-04-23 Thread Jakub Wilk

Package: moreutils
Version: 0.34
Severity: normal
File: /usr/bin/isutf8

$ man utf-8 | grep -A 2 UTF-16 | sed -e 's/^ *//'
The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well as 0xfffe
and 0x (UCS non-characters) should not appear in  conforming  UTF-8
streams.

$ s='\xed\xa0\x88\xed\xbd\x85' # 0xd808 + 0xdf45
$ printf $s | isutf8  echo $?
0


-- System Information:
Debian Release: squeeze/sid
  APT prefers unstable
  APT policy: (900, 'unstable'), (500, 'experimental')
Architecture: i386 (i686)

Kernel: Linux 2.6.26-1-686 (SMP w/2 CPU cores)
Locale: LANG=C, LC_CTYPE=pl_PL.utf8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages moreutils depends on:
ii  libc6 2.9-7  GNU C Library: Shared libraries
ii  perl  5.10.0-19  Larry Wall's Practical Extraction 


moreutils recommends no packages.

Versions of packages moreutils suggests:
pn  libtime-duration-perl none (no description available)
ii  libtimedate-perl  1.1600-9   Time and date functions for Perl

-- no debconf information

--
Jakub Wilk



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org