utf8::valid and \x14_000 - \x1F_0000
It appears that utf8::valid() disagrees with Encode::encode('utf8', ...) do not agree for characters 0x14_ - 0x1F_. I suggest utf8::valid() is broken. The following: use strict ; use Encode qw(FB_QUIET LEAVE_SRC) ; printf Perl v%vd Encode %s\n, $^V, $Encode::VERSION ; my $c = 0x ; while ($c 0x8000_) { my $s = chr($c) ; my $v = utf8::valid($s) ? 1 : 0 ; my $o = Encode::encode('utf8', $s, FB_QUIET() | LEAVE_SRC()) ; my $r = $o ? 1 : 0 ; if ($v != $r) { printf 0x%04X_%04X: utf8::valid=%d but Encode::encode=%d , ($c 16), $c 0x, $v, $r ; Encode::_utf8_off($s) ; print map { sprintf '\x%02X', ord($_) } split(//, $s) ; print \n ; } ; if ($c 0x) { $c += 1 ; } else { $c += 0x ; } ; } ; Produces: Perl v5.8.8 Encode 2.23 0x0014_: utf8::valid=0 but Encode::encode=1 \xF5\x80\x80\x80 0x0014_: utf8::valid=0 but Encode::encode=1 \xF5\x8F\xBF\xBF 0x0015_: utf8::valid=0 but Encode::encode=1 \xF5\x90\x80\x80 0x0015_: utf8::valid=0 but Encode::encode=1 \xF5\x9F\xBF\xBF 0x0016_: utf8::valid=0 but Encode::encode=1 \xF5\xA0\x80\x80 0x0016_: utf8::valid=0 but Encode::encode=1 \xF5\xAF\xBF\xBF 0x0017_: utf8::valid=0 but Encode::encode=1 \xF5\xB0\x80\x80 0x0017_: utf8::valid=0 but Encode::encode=1 \xF5\xBF\xBF\xBF 0x0018_: utf8::valid=0 but Encode::encode=1 \xF6\x80\x80\x80 0x0018_: utf8::valid=0 but Encode::encode=1 \xF6\x8F\xBF\xBF 0x0019_: utf8::valid=0 but Encode::encode=1 \xF6\x90\x80\x80 0x0019_: utf8::valid=0 but Encode::encode=1 \xF6\x9F\xBF\xBF 0x001A_: utf8::valid=0 but Encode::encode=1 \xF6\xA0\x80\x80 0x001A_: utf8::valid=0 but Encode::encode=1 \xF6\xAF\xBF\xBF 0x001B_: utf8::valid=0 but Encode::encode=1 \xF6\xB0\x80\x80 0x001B_: utf8::valid=0 but Encode::encode=1 \xF6\xBF\xBF\xBF 0x001C_: utf8::valid=0 but Encode::encode=1 \xF7\x80\x80\x80 0x001C_: utf8::valid=0 but Encode::encode=1 \xF7\x8F\xBF\xBF 0x001D_: utf8::valid=0 but Encode::encode=1 \xF7\x90\x80\x80 0x001D_: utf8::valid=0 but Encode::encode=1 \xF7\x9F\xBF\xBF 0x001E_: utf8::valid=0 but Encode::encode=1 \xF7\xA0\x80\x80 0x001E_: utf8::valid=0 but Encode::encode=1 \xF7\xAF\xBF\xBF 0x001F_: utf8::valid=0 but Encode::encode=1 \xF7\xB0\x80\x80 0x001F_: utf8::valid=0 but Encode::encode=1 \xF7\xBF\xBF\xBF And the same for: Perl v5.10.0 Encode 2.23 -- Chris Hall highwayman.com signature.asc Description: PGP signature
Re: utf8::valid and \x14_000 - \x1F_0000
On Tue, 11 Mar 2008 you wrote Chris Hall skribis 2008-03-11 18:48 (+): I'm comfortable with the notion that perl characters are unsigned integers that overlap UCS, and happen to be held internally as a superset of UTF-8. I wonder if perl is completely comfortable. It isn't. There are some very unfortunate features. chr(n) throws various runtime warnings where 'n' isn't kosher UCS, and \x{h...h} throws the same ones at compile time. (...)I'm not sure I see the point of picking on a few values to warn about. I don't see the point, but Perl's warnings are arbitrary in several ways. Abigail has a lightning talk about the interpreted as function warning, that illustrates this. OK. In the meantime IMHO chr(n) should be handling utf8 and has no business worrying about things which UTF-8 or UCS think aren't characters. Note that chr(n) is whingeing about 0xFFFE, which Encode::en/decode (UTF-8) are happy with. Unicode defines 0xFFFE and 0x as non-characters, not just 0x (which Encode::en/decode do deem invalid). In any case, is chr(n) supposed to be utf8 or UTF-8 ? AFAIKS, it's neither. It's supposed to be neither on the outside. Internally, it's utf8. One can turn off the warnings and then chr(n) will happily take any +ve integer and give you the equivalent character -- so the result is utf8, but the warnings are some (very) small subset of checking for UTF-8 :-( I wonder what happens for n = 2^64. The encoding runs out at 2^72 ! If chr(-1) doesn't exist, then undef looks like a reasonable return value -- returning \x{FFFD} makes chr(-1) indistinguishable from chr(0xFFFD) -- where the first is nonsense and the second is entirely proper. 0xFFFD is the Unicode equivalent of undef. I think it makse sense in this case. Well... Unicode says: REPLACEMENT CHARACTER: used to represent an incoming character whose value is unknown or unrepresentable in Unicode. ...so it has plenty to do without being used to represent a value which is completely beyond the range for characters, and for which perl has a perfectly good convention already. ...besides, if I want to see if chr(n) has worked I have to check that (a) the result is not \xFFFD and (b) that n is not 0xFFFD. So we'll have to differ on this :-) Chris -- Chris Hall highwayman.com+44 7970 277 383 signature.asc Description: PGP signature
Re: utf8::valid and \x14_000 - \x1F_0000
Chris Hall skribis 2008-03-11 21:09 (+): OK. In the meantime IMHO chr(n) should be handling utf8 and has no business worrying about things which UTF-8 or UCS think aren't characters. It should do Unicode, not any specific byte encoding, like UTF-?8. Internally, a byte encoding is needed. As a programmer I don't want to be bothered with such implementation details. Note that chr(n) is whingeing about 0xFFFE, which Encode::en/decode (UTF-8) are happy with. Unicode defines 0xFFFE and 0x as non-characters, not just 0x (which Encode::en/decode do deem invalid). Personally, I think Perl should accept these characters without warning, except the strict UTF-8 encoding is requested (which differs from the non-strict UTF8 encoding). In any case, is chr(n) supposed to be utf8 or UTF-8 ? AFAIKS, it's neither. It's supposed to be neither on the outside. Internally, it's utf8. One can turn off the warnings and then chr(n) will happily take any +ve integer and give you the equivalent character -- so the result is utf8, The result is Unicode. The difference between Unicode and UTF8 is not always clear, but in this case is: the character is Unicode, a single codepoint, the internal implementation is UTF8. Unicode: U+20AC(one character: €) UTF-8: E2 82 AC (three bytes) I am under the impression that you know the difference and made an honest mistake. My detailed expansion is also for lurkers and archives. [replacement character] So we'll have to differ on this :-) Yes, although my opinion on this is not strong. undef or replacement character - both are good options. One argument in favor of the replacement character would be backwards compatibility. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker [EMAIL PROTECTED] http://juerd.nl/sig Convolution: ICT solutions and consultancy [EMAIL PROTECTED]