RE: Variation In Decoding Between Encode and XML::LibXML
Hello (loved your PostgreSQL presentation at the most recent OSCON, BTW) Which editor do you use? When loading the script in Komodo IDE 5.2 the string looks broken. Running the script (ActivePerl 5.10.1 on Windows) only the second line is correct - the first (no surprise) and third are broken. Loading the file in UltraEdit-32 13.20+3, set to not convert the script on loading, it becomes obvious that what should have been one character is represented by 4 bytes, \xC3 \x84 \xC2 \x8D, which modern editors would probably show as 2 characters and as broken. It looks to me like the string is being displayed as a byte representation of the characters, if that makes sense. My english isn't perfect :-/ and what I am trying to say is that this is problem that I am quite familiar with. It happens whenever the source and the reader do not agree on whether a string is encoded in utf-8 or not. Apparently Encode fixes the incorrect string which is nice. The interesting thing is, where should this be fixed? If it's at Yahoo! Pipes you'll probably have to use Encode as a work-around for some time... Best regards Henning Michael Møller Just -Original Message- From: David E. Wheeler [mailto:da...@kineticode.com] Sent: Wednesday, June 16, 2010 7:56 AM To: perl-unicode@perl.org Subject: Variation In Decoding Between Encode and XML::LibXML Fellow Perlers, I'm parsing a lot of XML these days, and came upon a a Yahoo! Pipes feed that appears to mangle an originating Flickr feed. But the curious thing is, when I pull the offending string out of the RSS and just stick it in a script, Encode knows how to decode it properly, while XML::LibXML (and my Unicode-aware editors) cannot. The attached script demonstrates. $str has the bogus-looking character. Encode, however, seems to properly convert it to the č in Laurinavičius in the output. XML::LibXML, OTOH, outputs it as LaurinaviÄius -- that is, broken. (If things look truly borked in this email too, please look at the attached script.) So my question is, what gives? Is this truly a broken representation of the character and Encode just figures that out and fixes it? Or is there something off with my editor and with XML::LibXML. FWIW, the character looks correct in my editor when I load it from the original Flickr feed. It's only after processing by Yahoo! Pipes that it comes out looking mangled. Any insights would be appreciated. Best, David
Re: Variation In Decoding Between Encode and XML::LibXML
On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote: I think what I need is some code to strip non-utf8 characters from a string -- even if that string has the utf8 bit switched on. I thought that Encode would do that for me, but in this case apparently not. Anyone got an example? Tri this: Encode::_utf8_off($string); $string = Encode::decode('utf8', $string); That will replace any byte sequences which are invalid UTF-8 with the Unicode replacement character. If you want to guarantee that the flag is on first, do this: utf8::upgrade($string); Encode::_utf8_off($string); $string = Encode::decode('utf8', $string); Devel::Peek's Dump() function will come in handy for checking results. Cheers, Marvin Humphrey
Re: Variation In Decoding Between Encode and XML::LibXML
On Jun 16, 2010, at 4:47 PM, Marvin Humphrey wrote: On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote: I think what I need is some code to strip non-utf8 characters from a string -- even if that string has the utf8 bit switched on. I thought that Encode would do that for me, but in this case apparently not. Anyone got an example? Tri this: Encode::_utf8_off($string); $string = Encode::decode('utf8', $string); That will replace any byte sequences which are invalid UTF-8 with the Unicode replacement character. Yeah. Not working for me. See attached script. Devel::Peek says: SV = PV(0x100801f18) at 0x10082f368 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x1002015c0 pTomas Laurinavi\303\204\302\215ius/p\0 [UTF8 pTomas Laurinavi\x{c4}\x{8d}ius/p] CUR = 29 LEN = 32 So the UTF8 flag is enabled, and yet it has \303\204\302\215 in it. What is that crap? Confused and frustrated, David #!/usr/local/bin/perl -w use 5.12.0; use Encode; use Devel::Peek; my $str = 'pTomas LaurinaviÃÂius/p'; my $utf8 = decode('UTF-8', $str); say $str; binmode STDOUT, ':utf8'; say $utf8; Dump($utf8);
Re: Variation In Decoding Between Encode and XML::LibXML
On Wed, Jun 16, 2010 at 05:34:44PM -0700, David E. Wheeler wrote: So the UTF8 flag is enabled, and yet it has \303\204\302\215 in it. What is that crap? That's octal notation, which I think Dump() uses for any byte greater than 127 and for control characters, so that it can output pure ASCII. That sequence is only four bytes: mar...@smokey:~ $ perl -MEncode -MDevel::Peek -e '$s = \303\204\302\215; Encode::_utf8_on($s); Dump $s' SV = PV(0x801038) at 0x80e880 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x2012f0 \303\204\302\215\0 [UTF8 \x{c4}\x{8d}] CUR = 4 --- four bytes LEN = 8 mar...@smokey:~ $ The logical content of the string follows in the second quote: [UTF8 pTomas Laurinavi\x{c4}\x{8d}ius/p] That's valid UTF-8. my $str = 'pTomas Laurinaviius/p'; In source code, I try to stick to pure ASCII and use \x escapes -- like Dump() does. my $str = pTomas Laurinavi\x{c4}\x{8d}ius/p However, because those code points are both representable as Latin-1, Perl will create a Latin-1 string. If you want to force its internal encoding to UTF-8, you need to do additional work. mar...@smokey:~ $ perl -MDevel::Peek -e '$s = \x{c4}; Dump $s; utf8::upgrade($s); Dump $s' SV = PV(0x801038) at 0x80e870 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x2012e0 \304\0 CUR = 1 LEN = 4 SV = PV(0x801038) at 0x80e870 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x2008f0 \303\204\0 [UTF8 \x{c4}] CUR = 2 LEN = 3 mar...@smokey:~ $ Confused and frustrated, IMO, to get UTF-8 right consistently in a large Perl system, you need to understand the internals and you need Devel::Peek at hand. Perl tries to hide the details, but there are too many ways for it to fail silently. (perl -C, $YAML::Syck::ImplicitUnicode, etc.) Marvin Humphrey