Re: Don't use the \C escape in regexes - Why not?
Am 04.05.2010 um 11:09 schrieb Gisle Aas: I regret that I let \C sneak into the URI module. I might have understood why one might think that \C is not a good idea to use in that method, and maybe not in general. The fact that character strings in Perl are encoded in UTF-8 is an implementation detail, and you shouldn't bother, or make any assumptions about this technicality. But by using \C to derive an encoded version - a byte string - from a character string (and maybe even taking it for granted you'll get a UTF-8 byte string), you're tying your interface to an implementation detail. And the behaviour of your code will change as soon as Perl moves on to use, say, UTF-16 as the internal encoding. (Which is highly unlikely, but that's another story.) Is it this (theoretically fragile) implicitness in handling character strings that makes \C a bad idea? But probably not as bad an idea as relying on the default platform encoding in Java (default charset in Java API doc lingo), which may be different from country to country and from installation to installation. http://java.sun.com/javase/6/docs/api/java/lang/String.html#String%28byte[]%29 -- Michael.Ludwig (#) XING.com
Re: Don't use the \C escape in regexes - Why not?
Am 04.05.2010 um 13:06 schrieb Michael Ludwig: Is it this (theoretically fragile) implicitness in handling character strings that makes \C a bad idea? But probably not as bad an idea as relying on the default platform encoding in Java (default charset in Java API doc lingo), which may be different from country to country and from installation to installation. http://java.sun.com/javase/6/docs/api/java/lang/String.html#String%28byte[]%29 Or, more symmetrically to encoding via \C in Perl: http://java.sun.com/javase/6/docs/api/java/lang/String.html#getBytes%28%29 public byte[] getBytes() Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array. Much more serious and real than implicitly encoding via \C in Perl, given the fact that Java installations do not all use the same platform encoding, while all current Perl installations use the same internal encoding. (All Java installations use the same internal encoding of UTF-16, I think, but this fact is well hidden from the interface.) -- Michael.Ludwig (#) XING.com
Re: Don't use the \C escape in regexes - Why not?
I regret that I let \C sneak into the URI module. Now we have an interface that depends on the internal UTF-8 flag of the stings passed in. This makes it very hard to explain, makes it not do what you want when different type of strings are combined and makes it hard to fix in ways that don't break some code. My plan for fixing this is to introduce URI::IRI with an interface that encode all non-URI characters as percent-encoded UTF-8 and live with the inconsistency for URI (until Perl redefine what \C means). --Gisle On May 3, 2010, at 20:34, Michael Ludwig wrote: Don't use the \C escape in regexes - taken from Juerd's Unicode Advice page: http://juerd.nl/site.plp/perluniadvice Why not? -- perldoc perlre: \C Match a single C char (octet) even under Unicode. NOTE: breaks up characters into their UTF-8 bytes, so you may end up with malformed pieces of UTF-8. Unsupported in lookbehind. -- URI::Escape sub escape_char { return join '', @URI::Escape::escapes{$_[0] =~ /(\C)/g}; } The regular expression is used to disassemble an incoming text string into individual bytes (and then use the resulting list in a hash slice). It is a legitimate use case, and the means seems to do the job. What's the problem with the \C escape? -- Michael.Ludwig (#) XING.com
Re: Don't use the \C escape in regexes - Why not?
* Michael Ludwig michael.lud...@xing.com [2010-05-04 14:55]: But wait a second: While URIs are meant to be made of characters, they're also meant to go over the wire, and there are no characters on the wire, only bytes. There is no standard encoding defined for the wire, although UTF-8 has come to be seen as the standard encoding for URIs containing non-ASCII characters. Perl having two standard encodings (UTF-8 and ISO-8859-1) for text and relying on the internal flag to tell which one is meant to matter, shouldn't the URI module either only accept bytes or only characters? Or rather, provide two different constructors instead of only one trying to be intelligent? URI-bytes( $bytes ); # byte string URI-chars( $chars ); # character string And, in addition, define the character encoding used for serialization. Yes, exactly. And both methods would use the moral equivalent of a plain `split //` – no trickery such as with `\C`. The only difference between then is that the `chars` method would `encode_utf8` the string first and then encode it blindly, whereas the `bytes` method would leave it as is but then croak if it found a codepoint 0xFF (since the string is supposed to represent an octet sequence already). Notably absent in both cases: any dependence on the state of the UTF8 flag of the string. Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/