Re: utf8::valid and \x14_000 - \x1F_0000
On Wed, 12 Mar 2008 Juerd Waalboer wrote Chris Hall skribis 2008-03-12 20:49 (+): a. are you saying that characters in Perl are Unicode ? Yes. They are called Unicode, at least. This has my preference for explanation and documentation. b. or are you agreeing that characters in Perl take values 0..0x7FFF_ (or beyond), which are generally interpreted as UCS, where required and possible ? This too. This is the more technically accurate explanation, and has my preference for implementation. 'This too' ? Goodness, superimposition ! Perl and quantum mechanics ? Suddenly it all becomes clear. Or at least as clear as the uncertainty principle will allow !-) FWIW, I have tried some of the HTTP, HTML and XML modules. The warnings that pop out every now and then about Unicode or UTF-8 or whatever are less than useful and more than irritating ! If (a) then characters with ordinals beyond 0x10_ should throw warnings (at least) since they clearly are not Unicode ! Perl just has a somewhat broad definition of unicode, that is not the same as the official unicode character set. BTW, in 2.14 Conforming to the Unicode Standard I found this gem: Unacceptable Behavior It is unacceptable for a conforming implementation: - To use unassigned codes. • U+2073 is unassigned and not usable for ‘3’ (superscript 3) or any other character. This appears to say that unassigned codes should not be transmitted out, same like non-characters ! Which looks like hard work. (On the other hand, applications are supposed to cope with future defined code points...) Should 'UTF-8' be strict about unassigned codes as well ? What should chr() and \x{...} etc. do ? This reinforces my view that chr(n) is (a) wrong to whinge about surrogates and non-characters, and (b) wrong to return a character for n outside 0x..7FFF_. IMO: - chr() shouldn't worry about strict UCS ... - ... and doesn't, in an case, do a complete job [it does spot all non-characters and surrogates, but ignores unassigned codes.] - ... however, non-characters are perfectly legal UCS, at least for internal use. One can argue for jumping all over these when outputting (strict) UTF-8 for external exchange. - ... and 0x11_FFFE is not defined by UCS to be a non-character, it's not defined in UCS at all, any more than any other character code U+10_ ! - chr(n) doesn't whinge about characters U+10_ ! (Except for the non-characters it has invented !) - the answer to chr(-1) is 'not a character at all' -- it isn't 'the character that stands in place of some unknown character' - the utility of characters 0x7FFF_ is not worth (a) the kludge required to extend utf8, or (b) the interoperability issues. Even encode/decode 'utf8' take a dim view of chars 0x7FFF_. I note that utf8::valid() rejects characters 0x7FFF_ ! - chr(n) accepts characters 0x7FFF_, even though the result is not valid per utf8::valid() !! - chr(n) warns about p + 0xFFFE and p + 0x for every value of 'p', even those which are beyond the Unicode range ! It has its own utf8, it can have its own unicode too :) And there was I thinking that things were already sufficiently confused :-} The 'utf8' decode does the Right Thing -- it decodes well-formed UTF-8 up to 0x7FFF_ and handles errors and incomplete sequences and doesn't concern itself with the minutiae of UCS (surrogates, non-characters and unassigned codes). This is nicely consistent with utf8::valid(). [The only thing I would argue about is the separate treatment of each byte of an invalid sequence -- I'd be tempted to treat 0x00..0x7F and 0xC0..0xFF as terminators of an invalid sequence and 0x80..0xBF as members of an invalid sequence.] If 'unicode' were to follow that model, then chr() and friends could stop throwing (spurious) warnings around the place. Sadly, 'utf8' encode is doesn't care, and outputs whatever is in the string -- including redundant sequences, invalid sequences, incomplete sequences and Perl's extended sequences for 0x7FFF_. That is, it will happily output something that utf8::valid would reject. Note that this encoding is outputting something that 'utf8' decode won't accept. If you really want what 'utf8' encode currently does you can force characters to octets (wax off) and output. The reverse is to input the octets and force to characters (wax on). Summary of Observations --- * chr(n) and friends are broken: - they winge about things that are none of their business, which is not consistent with the notion of (lax) 'unicode'. - the wingeing about not-(strict)-Unicode is, moreover, incomplete (unassigned codes and codes beyond the UCS range are allowed !) - non-characters are perfectly legal -- just not suitable for external exchange. -
Re: utf8::valid and \x14_000 - \x1F_0000
Chris Hall skribis 2008-03-12 13:20 (+): OK. In the meantime IMHO chr(n) should be handling utf8 and has no business worrying about things which UTF-8 or UCS think aren't characters. It should do Unicode, not any specific byte encoding, like UTF-?8. IMHO chr(n) should do characters, which may be interpreted as per Unicode, but may not. When I said utf8 I was following the (sloppy) convention that utf8 means how Perl handles characters in strings... I'm working hard to break this convention. I've changed a lot of Perl documentation, and the result was released with Perl 5.10. If in any place in Perl's official documentation, it still reads UTF-8 or UTF8 for *characters in text strings*, it's wrong. Let me know and I will fix it :) b. in a Perl string, characters are held in a UTF-8 like form. I'd say *inside* a Perl string. This is the C implementation, but a Perl programmer should not have to know the specific *internal* encoding of a Perl string. Likewise, in Perl you don't have to know whether your number is internally encoded as a long integer or a double. Where UTF-8 (upper case, with hyphen) means the RFC 3629 Unicode Consortium defined byte-wise encoding. That's the theory, but it's so often not entirely following spec. This form is referred to as utf8 (lower case, no hyphen). Yes, but note that encoding names in Perl are case insensitive. I tend to call it UTF8 sometimes. There is really no need to discuss this, except in the context of messing around in guts of Perl. Exactly. String literals are represented by UCS code points. Which reinforces the feeling that characters in Perl are Unicode. Yes! 'C' uses 'wide' to refer to characters that may have values 255. IMHO it's a shame that Perl did not follow this. It does in some places, most notably warnings about wide characters. d. when exchanging character data with other systems one needs to deal with character set and encoding issues. Not just other systems. All I/O is done in bytes, even with yourself, for example if you forked. Isolated surrogate code units have no interpretation on their own. (...) Clearly these are illegal in UTF-8. They have no interpretation, but this also doesn't say it's illegal. Compare it with the undefined behavior of multiple ++ in a single expression. There's no specification of what should happen, but it's not illegal to do it. Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. I think it's not Perl's job to prevent exchange. Simply because the exchange could be internal, but between processes of the same program. I'm puzzled as to why 'UTF-8' (strict) doesn't treat U+FFFE (and friends) in the same way as U+ (and friends). My gut says it's out of ignorance of the rules, and certainly not an intentional deviation. The result is Unicode. IMHO the result of chr(n) should just be a character. We call that a unicode character in Perl. It is true that Perl allows ordinal values outside the currently existing range, but it is still called unicode by Perl's documentation. OK, sure. I was using utf8 to mean any character value you like, and UTF-8 to imply a value which is recognised in UCS -- rather than the encoding. Please use utf8 only for naming the byte encoding that allows any character value you like, not for the ordinal values themselves. FWIW I note that printf %vX is suggested as a means to render IPv6 addresses. This implies the use of a string containing eight characters 0..0x as the packed form of IPv6. Building one of those using chr(n) will generate spurious warnings about 0xFFFE and 0x ! Interesting point. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker [EMAIL PROTECTED] http://juerd.nl/sig Convolution: ICT solutions and consultancy [EMAIL PROTECTED]
Re: utf8::valid and \x14_000 - \x1F_0000
On Wed, 12 Mar 2008 Juerd Waalboer wrote Chris Hall skribis 2008-03-12 13:20 (+): String literals are represented by UCS code points. Which reinforces the feeling that characters in Perl are Unicode. Yes! OK. For the avoidance of doubt: a. are you saying that characters in Perl are Unicode ? b. or are you agreeing that characters in Perl take values 0..0x7FFF_ (or beyond), which are generally interpreted as UCS, where required and possible ? If (a) then characters with ordinals beyond 0x10_ should throw warnings (at least) since they clearly are not Unicode ! [in the context of U+D800..U+DFFF] Isolated surrogate code units have no interpretation on their own. (...) Clearly these are illegal in UTF-8. They have no interpretation, but this also doesn't say it's illegal. The Unicode Standard defines the set of 'Unicode scalar values' which consists of U+..U+D7FF and U+E000..U+10_. All Unicode encodings, including UTF-8, encode only the 'Unicode scalar values'. The code points U+D800..U+DFFF exist, but do not contain any character assignments. Given that no Unicode encoding exists that allows these code points, it's unclear how one would ever end up with one of these things on its hands ! [in the context of U+FFFE, U+ etc.] Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. I think it's not Perl's job to prevent exchange. Simply because the exchange could be internal, but between processes of the same program. Well UTF-8 is jumping all over U+ (at least). The warnings thrown by chr() and \x{h...h} suggest that Perl feels that exchanging these values ain't kosher. I'm puzzled as to why 'UTF-8' (strict) doesn't treat U+FFFE (and friends) in the same way as U+ (and friends). My gut says it's out of ignorance of the rules, and certainly not an intentional deviation. Well... I'm running some more tests on UTF-8 to see what it thinks is illegal. . The result is Unicode. IMHO the result of chr(n) should just be a character. We call that a unicode character in Perl. It is true that Perl allows ordinal values outside the currently existing range, but it is still called unicode by Perl's documentation. OK. This is the hair which I am splitting. IMHO the things in strings and the things that chr() and ord() return or process should be plain characters (ordinal U_INT) -- so that these are general purpose. Only when it's necessary to attach meaning to the characters in a string, should Perl treat them as Unicode code points -- I accept that this is most of the time (but not *all* the time). FWIW I note that printf %vX is suggested as a means to render IPv6 addresses. This implies the use of a string containing eight characters 0..0x as the packed form of IPv6. Building one of those using chr(n) will generate spurious warnings about 0xFFFE and 0x ! Interesting point. What's more, the Unicode standard suggests various *internal* uses for U+FFFE and U+ (and friends), including, but not limited to, terminators and separators. This will also generate spurious warnings from chr() or \x{...} ! Chris -- Chris Hall highwayman.com signature.asc Description: PGP signature
Re: utf8::valid and \x14_000 - \x1F_0000
On Tue, 11 Mar 2008 you wrote Chris Hall skribis 2008-03-11 18:48 (+): I'm comfortable with the notion that perl characters are unsigned integers that overlap UCS, and happen to be held internally as a superset of UTF-8. I wonder if perl is completely comfortable. It isn't. There are some very unfortunate features. chr(n) throws various runtime warnings where 'n' isn't kosher UCS, and \x{h...h} throws the same ones at compile time. (...)I'm not sure I see the point of picking on a few values to warn about. I don't see the point, but Perl's warnings are arbitrary in several ways. Abigail has a lightning talk about the interpreted as function warning, that illustrates this. OK. In the meantime IMHO chr(n) should be handling utf8 and has no business worrying about things which UTF-8 or UCS think aren't characters. Note that chr(n) is whingeing about 0xFFFE, which Encode::en/decode (UTF-8) are happy with. Unicode defines 0xFFFE and 0x as non-characters, not just 0x (which Encode::en/decode do deem invalid). In any case, is chr(n) supposed to be utf8 or UTF-8 ? AFAIKS, it's neither. It's supposed to be neither on the outside. Internally, it's utf8. One can turn off the warnings and then chr(n) will happily take any +ve integer and give you the equivalent character -- so the result is utf8, but the warnings are some (very) small subset of checking for UTF-8 :-( I wonder what happens for n = 2^64. The encoding runs out at 2^72 ! If chr(-1) doesn't exist, then undef looks like a reasonable return value -- returning \x{FFFD} makes chr(-1) indistinguishable from chr(0xFFFD) -- where the first is nonsense and the second is entirely proper. 0xFFFD is the Unicode equivalent of undef. I think it makse sense in this case. Well... Unicode says: REPLACEMENT CHARACTER: used to represent an incoming character whose value is unknown or unrepresentable in Unicode. ...so it has plenty to do without being used to represent a value which is completely beyond the range for characters, and for which perl has a perfectly good convention already. ...besides, if I want to see if chr(n) has worked I have to check that (a) the result is not \xFFFD and (b) that n is not 0xFFFD. So we'll have to differ on this :-) Chris -- Chris Hall highwayman.com+44 7970 277 383 signature.asc Description: PGP signature
Re: utf8::valid and \x14_000 - \x1F_0000
Chris Hall skribis 2008-03-11 21:09 (+): OK. In the meantime IMHO chr(n) should be handling utf8 and has no business worrying about things which UTF-8 or UCS think aren't characters. It should do Unicode, not any specific byte encoding, like UTF-?8. Internally, a byte encoding is needed. As a programmer I don't want to be bothered with such implementation details. Note that chr(n) is whingeing about 0xFFFE, which Encode::en/decode (UTF-8) are happy with. Unicode defines 0xFFFE and 0x as non-characters, not just 0x (which Encode::en/decode do deem invalid). Personally, I think Perl should accept these characters without warning, except the strict UTF-8 encoding is requested (which differs from the non-strict UTF8 encoding). In any case, is chr(n) supposed to be utf8 or UTF-8 ? AFAIKS, it's neither. It's supposed to be neither on the outside. Internally, it's utf8. One can turn off the warnings and then chr(n) will happily take any +ve integer and give you the equivalent character -- so the result is utf8, The result is Unicode. The difference between Unicode and UTF8 is not always clear, but in this case is: the character is Unicode, a single codepoint, the internal implementation is UTF8. Unicode: U+20AC(one character: €) UTF-8: E2 82 AC (three bytes) I am under the impression that you know the difference and made an honest mistake. My detailed expansion is also for lurkers and archives. [replacement character] So we'll have to differ on this :-) Yes, although my opinion on this is not strong. undef or replacement character - both are good options. One argument in favor of the replacement character would be backwards compatibility. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker [EMAIL PROTECTED] http://juerd.nl/sig Convolution: ICT solutions and consultancy [EMAIL PROTECTED]