Re: utf8::valid and \x14_000 - \x1F_0000
Chris Hall skribis 2008-03-12 13:20 (+): OK. In the meantime IMHO chr(n) should be handling utf8 and has no business worrying about things which UTF-8 or UCS think aren't characters. It should do Unicode, not any specific byte encoding, like UTF-?8. IMHO chr(n) should do characters, which may be interpreted as per Unicode, but may not. When I said utf8 I was following the (sloppy) convention that utf8 means how Perl handles characters in strings... I'm working hard to break this convention. I've changed a lot of Perl documentation, and the result was released with Perl 5.10. If in any place in Perl's official documentation, it still reads UTF-8 or UTF8 for *characters in text strings*, it's wrong. Let me know and I will fix it :) b. in a Perl string, characters are held in a UTF-8 like form. I'd say *inside* a Perl string. This is the C implementation, but a Perl programmer should not have to know the specific *internal* encoding of a Perl string. Likewise, in Perl you don't have to know whether your number is internally encoded as a long integer or a double. Where UTF-8 (upper case, with hyphen) means the RFC 3629 Unicode Consortium defined byte-wise encoding. That's the theory, but it's so often not entirely following spec. This form is referred to as utf8 (lower case, no hyphen). Yes, but note that encoding names in Perl are case insensitive. I tend to call it UTF8 sometimes. There is really no need to discuss this, except in the context of messing around in guts of Perl. Exactly. String literals are represented by UCS code points. Which reinforces the feeling that characters in Perl are Unicode. Yes! 'C' uses 'wide' to refer to characters that may have values 255. IMHO it's a shame that Perl did not follow this. It does in some places, most notably warnings about wide characters. d. when exchanging character data with other systems one needs to deal with character set and encoding issues. Not just other systems. All I/O is done in bytes, even with yourself, for example if you forked. Isolated surrogate code units have no interpretation on their own. (...) Clearly these are illegal in UTF-8. They have no interpretation, but this also doesn't say it's illegal. Compare it with the undefined behavior of multiple ++ in a single expression. There's no specification of what should happen, but it's not illegal to do it. Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. I think it's not Perl's job to prevent exchange. Simply because the exchange could be internal, but between processes of the same program. I'm puzzled as to why 'UTF-8' (strict) doesn't treat U+FFFE (and friends) in the same way as U+ (and friends). My gut says it's out of ignorance of the rules, and certainly not an intentional deviation. The result is Unicode. IMHO the result of chr(n) should just be a character. We call that a unicode character in Perl. It is true that Perl allows ordinal values outside the currently existing range, but it is still called unicode by Perl's documentation. OK, sure. I was using utf8 to mean any character value you like, and UTF-8 to imply a value which is recognised in UCS -- rather than the encoding. Please use utf8 only for naming the byte encoding that allows any character value you like, not for the ordinal values themselves. FWIW I note that printf %vX is suggested as a means to render IPv6 addresses. This implies the use of a string containing eight characters 0..0x as the packed form of IPv6. Building one of those using chr(n) will generate spurious warnings about 0xFFFE and 0x ! Interesting point. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker [EMAIL PROTECTED] http://juerd.nl/sig Convolution: ICT solutions and consultancy [EMAIL PROTECTED]
Re: utf8::valid and \x14_000 - \x1F_0000
On Wed, 12 Mar 2008 Juerd Waalboer wrote Chris Hall skribis 2008-03-12 13:20 (+): String literals are represented by UCS code points. Which reinforces the feeling that characters in Perl are Unicode. Yes! OK. For the avoidance of doubt: a. are you saying that characters in Perl are Unicode ? b. or are you agreeing that characters in Perl take values 0..0x7FFF_ (or beyond), which are generally interpreted as UCS, where required and possible ? If (a) then characters with ordinals beyond 0x10_ should throw warnings (at least) since they clearly are not Unicode ! [in the context of U+D800..U+DFFF] Isolated surrogate code units have no interpretation on their own. (...) Clearly these are illegal in UTF-8. They have no interpretation, but this also doesn't say it's illegal. The Unicode Standard defines the set of 'Unicode scalar values' which consists of U+..U+D7FF and U+E000..U+10_. All Unicode encodings, including UTF-8, encode only the 'Unicode scalar values'. The code points U+D800..U+DFFF exist, but do not contain any character assignments. Given that no Unicode encoding exists that allows these code points, it's unclear how one would ever end up with one of these things on its hands ! [in the context of U+FFFE, U+ etc.] Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. I think it's not Perl's job to prevent exchange. Simply because the exchange could be internal, but between processes of the same program. Well UTF-8 is jumping all over U+ (at least). The warnings thrown by chr() and \x{h...h} suggest that Perl feels that exchanging these values ain't kosher. I'm puzzled as to why 'UTF-8' (strict) doesn't treat U+FFFE (and friends) in the same way as U+ (and friends). My gut says it's out of ignorance of the rules, and certainly not an intentional deviation. Well... I'm running some more tests on UTF-8 to see what it thinks is illegal. . The result is Unicode. IMHO the result of chr(n) should just be a character. We call that a unicode character in Perl. It is true that Perl allows ordinal values outside the currently existing range, but it is still called unicode by Perl's documentation. OK. This is the hair which I am splitting. IMHO the things in strings and the things that chr() and ord() return or process should be plain characters (ordinal U_INT) -- so that these are general purpose. Only when it's necessary to attach meaning to the characters in a string, should Perl treat them as Unicode code points -- I accept that this is most of the time (but not *all* the time). FWIW I note that printf %vX is suggested as a means to render IPv6 addresses. This implies the use of a string containing eight characters 0..0x as the packed form of IPv6. Building one of those using chr(n) will generate spurious warnings about 0xFFFE and 0x ! Interesting point. What's more, the Unicode standard suggests various *internal* uses for U+FFFE and U+ (and friends), including, but not limited to, terminators and separators. This will also generate spurious warnings from chr() or \x{...} ! Chris -- Chris Hall highwayman.com signature.asc Description: PGP signature
UTF-8 (strict) appears borken
1. 'Ill-formed' UTF-8 = The Unicode Standard specifies that any UTF-8 sequence that does not correspond to this table is 'ill-formed': Code Points| 1st Byte | 2nd Byte | 3rd Byte | 4th Byte | ---+--+--+--+--+ U+..U+007F | 00..7F |--|--|--| U+0080..U+07FF | C2..DF | 80..BF |--|--| U+0800..U+0FFF |E0| A0..BF | 80..BF |--| U+1000..U+CFFF | E1..EC | 80..BF | 80..BF |--| U+D000..U+D7FF |ED| 80..9F | 80..BF |--| U+E000..U+ | EE..EF | 80..BF | 80..BF |--| U+1..U+3 |F0| 90..BF | 80..BF | 80..BF | U+4..U+F | F1..F3 | 80..BF | 80..BF | 80..BF | U+10..U+10 |F4| 80..8F | 80..BF | 80..BF | Note in particular that: - anything beyond U+10 is ill-formed. - anything U+D800..U+DFFF is ill-formed. - only one encoding for each Code Point is well-formed. We'd expect UTF-8 decode to spot ill-formed sequences. Though some special handling of incomplete sequences at the end of a buffer would be handy. We'd expect UTF-8 encode to only generate well-formed sequences. 2. Extended Sequences = Unicode and ISO/IEC 10646:2003 define meanings for UTF-8 compatible sequences up to 6 bytes, which allows for characters up to 0x7FFF_. The Unicode reference code for reading UTF-8 recognises these extended sequences as being single entities (though ill-formed). Perl has its own further 7 and 13 byte forms, allowing for characters up to 0xF__ and 2^72-1, respectively. These are beyond UTF-8. 3. Non-Characters = The only other cause for concern are non-characters. These are: * U+FFFE and U+ and the last two code points in every other Unicode plane. Unicode code space is divided into 17 'planes' of 65,536 characters, each. So characters U+01_FFFE, U+01_, U+02_FFFE, U+02_, ... U+10_FFFE and U+10_ are all non-characters. * U+FDD0..U+FDEF Now, Unicode 5.0.0 says: Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as removing it from the text. Noncharacter code points are reserved for internal use, such as for sentinel values. They should never be interchanged. They do, however, have well-formed representations in Unicode encoding forms and survive conversions between encoding forms. This allows sentinel values to be preserved internally across Unicode encoding forms, even though they are not designed to be used in open interchange. So... this is not so clear-cut. For open interchange UTF-8 should disallow the non-characters. However, for local storage of Unicode stuff, non-characters should be allowed. 4. What 'UTF-8' Does Ill-formed sequences -- fine (mostly): * UTF-8 decode treats these as errors, and will stop or use fallback decoding as required. The default fallback is: - errors for sequence = 0x7FFF_ -- replaced by U+FFFD *** information is being lost, here :-( - anything else: each byte which is not recognised as being part of a complete 2..6 byte sequence is replaced by U+FFFD *** so one cannot distinguish ill-formed sequences from out of range characters. The PERLQQ, HTMLCREF and XMLCREF fallbacks are: - errors for sequence = 0x7FFF_ -- replaced by the respective escape sequence for the character value. This ought to work if the data is HTML or XML, where new escape sequences fit right in if HTMLCREF or XMLCREF is used. *** PERLQQ, however, may fail if '\' appears in the input and the sender has not escaped it ! Perhaps PERLQQ should escape '\' that appear in the input ? *** In all cases, however, all that's been achieved is that non-UTF-8 characters have been transliterated. It's still a puzzle what may be done with these characters ! - anything else: each byte which is not recognised as being part of a complete up to 6 byte sequence is replaced by the respective escape sequence for the byte value. *** this is impossible to distinguish from escaped values which could exist in the input ! * UTF-8 encode will not generate ill-formed sequences and treats out of ranges character values as errors. Errors will stop encoding or cause the fallback encoding to be used. The default fallback is: - errored characters = 0x7FFF_ -- replaced by