Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-19 Thread Michael Ludwig
David E. Wheeler schrieb am 16.06.2010 um 13:59 (-0700): On Jun 16, 2010, at 9:05 AM, David E. Wheeler wrote: On Jun 16, 2010, at 2:34 AM, Michael Ludwig wrote: In order to print Unicode text strings (as opposed to octet strings) correctly to a terminal (UTF-8 or not), add the following

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-18 Thread John Delacour
At 00:27 +0100 18/6/10, I wrote: If I save the file and undo the second decoding I get the proper output In this case all talk of iso-8859-1 and cp1252 is a red herring. I read several Italian websites where this same problem is manifest in external material such as ads. The news page

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-18 Thread David E. Wheeler
On Jun 18, 2010, at 12:05 AM, John Delacour wrote: In this case all talk of iso-8859-1 and cp1252 is a red herring. I read several Italian websites where this same problem is manifest in external material such as ads. The news page proper is encoded properly and declared as utf-8 but I

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-17 Thread David E. Wheeler
On Jun 17, 2010, at 12:30 PM, Henning Michael Møller Just wrote: So it may be valid UTF-8, but why does it come out looking like crap? That is, LaurinaviÃ≥Ÿius? I suppose there's an argument that LaurinaviÄŸius is correct and valid, if ugly. Maybe? I am unsure if this is the explanation

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-17 Thread John Delacour
At 13:24 -0700 17/6/10, David E. Wheeler wrote: On Jun 17, 2010, at 12:30 PM, Henning Michael Møller Just wrote: So the original character \x{010d} is represented by the bytes \x{c4} and \x{8d}, an application thinks those are in fact characters and encodes them again as \x{c3} + \x{84} and

RE: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Henning Michael Møller Just
Hello (loved your PostgreSQL presentation at the most recent OSCON, BTW) Which editor do you use? When loading the script in Komodo IDE 5.2 the string looks broken. Running the script (ActivePerl 5.10.1 on Windows) only the second line is correct - the first (no surprise) and third are broken.

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Marvin Humphrey
On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote: I think what I need is some code to strip non-utf8 characters from a string -- even if that string has the utf8 bit switched on. I thought that Encode would do that for me, but in this case apparently not. Anyone got an example?

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread David E. Wheeler
On Jun 16, 2010, at 4:47 PM, Marvin Humphrey wrote: On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote: I think what I need is some code to strip non-utf8 characters from a string -- even if that string has the utf8 bit switched on. I thought that Encode would do that for me,

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Marvin Humphrey
On Wed, Jun 16, 2010 at 05:34:44PM -0700, David E. Wheeler wrote: So the UTF8 flag is enabled, and yet it has \303\204\302\215 in it. What is that crap? That's octal notation, which I think Dump() uses for any byte greater than 127 and for control characters, so that it can output pure ASCII.