Re: Don't use the \C escape in regexes - Why not?

2010-05-04 Thread Michael Ludwig
Am 04.05.2010 um 11:09 schrieb Gisle Aas:

 I regret that I let \C sneak into the URI module.


I might have understood why one might think that \C is not a good idea to use 
in that method, and maybe not in general.

The fact that character strings in Perl are encoded in UTF-8 is an 
implementation detail, and you shouldn't bother, or make any assumptions about 
this technicality. But by using \C to derive an encoded version - a byte string 
- from a character string (and maybe even taking it for granted you'll get a 
UTF-8 byte string), you're tying your interface to an implementation detail. 
And the behaviour of your code will change as soon as Perl moves on to use, 
say, UTF-16 as the internal encoding. (Which is highly unlikely, but that's 
another story.)

Is it this (theoretically fragile) implicitness in handling character strings 
that makes \C a bad idea?

But probably not as bad an idea as relying on the default platform encoding in 
Java (default charset in Java API doc lingo), which may be different from 
country to country and from installation to installation.

http://java.sun.com/javase/6/docs/api/java/lang/String.html#String%28byte[]%29

-- 
Michael.Ludwig (#) XING.com



Re: Don't use the \C escape in regexes - Why not?

2010-05-04 Thread Michael Ludwig
Am 04.05.2010 um 13:06 schrieb Michael Ludwig:

 Is it this (theoretically fragile) implicitness in handling character strings 
 that makes \C a bad idea?
 
 But probably not as bad an idea as relying on the default platform encoding 
 in Java (default charset in Java API doc lingo), which may be different 
 from country to country and from installation to installation.
 
 http://java.sun.com/javase/6/docs/api/java/lang/String.html#String%28byte[]%29

Or, more symmetrically to encoding via \C in Perl:

http://java.sun.com/javase/6/docs/api/java/lang/String.html#getBytes%28%29

  public byte[] getBytes()
Encodes this String into a sequence of bytes
using the platform's default charset, storing
the result into a new byte array.

Much more serious and real than implicitly encoding via \C in Perl, given the 
fact that Java installations do not all use the same platform encoding, while 
all current Perl installations use the same internal encoding. (All Java 
installations use the same internal encoding of UTF-16, I think, but this fact 
is well hidden from the interface.)

-- 
Michael.Ludwig (#) XING.com



Re: Don't use the \C escape in regexes - Why not?

2010-05-04 Thread Gisle Aas
I regret that I let \C sneak into the URI module.  Now we have an interface 
that depends on the internal UTF-8 flag of the stings passed in.  This makes it 
very hard to explain, makes it not do what you want when different type of 
strings are combined and makes it hard to fix in ways that don't break some 
code.  My plan for fixing this is to introduce URI::IRI with an interface that 
encode all non-URI characters as percent-encoded UTF-8 and live with the 
inconsistency for URI (until Perl redefine what \C means).

--Gisle


On May 3, 2010, at 20:34, Michael Ludwig wrote:

 Don't use the \C escape in regexes - taken from Juerd's Unicode Advice page:
 
  http://juerd.nl/site.plp/perluniadvice
 
 Why not?
 
 -- perldoc perlre:
 \C  Match a single C char (octet) even under Unicode.
NOTE: breaks up characters into their UTF-8 bytes,
so you may end up with malformed pieces of UTF-8.
Unsupported in lookbehind.
 
 -- URI::Escape
 sub escape_char {
return join '', @URI::Escape::escapes{$_[0] =~ /(\C)/g};
 }
 
 The regular expression is used to disassemble an incoming text string into 
 individual bytes (and then use the resulting list in a hash slice). It is a 
 legitimate use case, and the means seems to do the job. What's the problem 
 with the \C escape?
 
 -- 
 Michael.Ludwig (#) XING.com
 



Re: Don't use the \C escape in regexes - Why not?

2010-05-04 Thread Aristotle Pagaltzis
* Michael Ludwig michael.lud...@xing.com [2010-05-04 14:55]:
 But wait a second: While URIs are meant to be made of
 characters, they're also meant to go over the wire, and there
 are no characters on the wire, only bytes. There is no standard
 encoding defined for the wire, although UTF-8 has come to be
 seen as the standard encoding for URIs containing non-ASCII
 characters. Perl having two standard encodings (UTF-8 and
 ISO-8859-1) for text and relying on the internal flag to tell
 which one is meant to matter, shouldn't the URI module either
 only accept bytes or only characters? Or rather, provide two
 different constructors instead of only one trying to be
 intelligent?

  URI-bytes( $bytes ); # byte string
  URI-chars( $chars ); # character string

 And, in addition, define the character encoding used for
 serialization.

Yes, exactly. And both methods would use the moral equivalent of
a plain `split //` – no trickery such as with `\C`. The only
difference between then is that the `chars` method would
`encode_utf8` the string first and then encode it blindly,
whereas the `bytes` method would leave it as is but then croak if
it found a codepoint  0xFF (since the string is supposed to
represent an octet sequence already).

Notably absent in both cases: any dependence on the state of the
UTF8 flag of the string.

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/