Re: Variation In Decoding Between Encode and XML::LibXML
David E. Wheeler schrieb am 16.06.2010 um 13:59 (-0700): On Jun 16, 2010, at 9:05 AM, David E. Wheeler wrote: On Jun 16, 2010, at 2:34 AM, Michael Ludwig wrote: In order to print Unicode text strings (as opposed to octet strings) correctly to a terminal (UTF-8 or not), add the following line before the first output: binmode STDOUT, ':utf8'; But note that STDOUT is global. Yes, I do this all the time. Surprisingly, I don't get warnings for this script, even though it is outputting multibyte characters. This is key. If I set the binmode on STDOUT to :utf8, the bogus characters print out bogus. If I set it to :raw, they come out right after processing by both Encode and XML::LibXML (I'm assuming they're interpreted as latin-1). Yes, or as raw, which is equivalent. Any octet is valid Latin-1. So my question is this: Why isn't Encode dying when it runs into these characters? They're not valid utf-8, AFAICT. Are they somehow valid utf8 (that is, valid in Perl's internal format)? Why would they be? Assuming we're talking about the same thing here: They're not characters, they're octets. (The Perl documentation seems to make an effort to conceptually distinguish between *octets* and *bytes*, but they map to the same thing.) I found it helpful to accept that the notion of a UTF-8 character does not make sense: there are Unicode characters, but UTF-8 is an encoding, and it deals with octets. Here's your script with some modifications to illustrate how things work: \,,,/ (o o) --oOOo-(_)-oOOo-- use strict; use Encode; use XML::LibXML; # The script is written in UTF-8, but the utf8 pragma is not turned on. # So the literals in our script yield octet strings, not text strings. # (Note that it is probably much more convenient to go with the utf8 # pragma if you write your source code in UTF-8.) my $octets = 'pTomas Laurinavičius/p'; my $txt= decode_utf8( $octets ); my $txt2 = pTomas Laurinavi\x{010d}ius/p; die if $txt2 ne $txt;# they're equal die if $txt2 eq $octets; # they're not equal # print raw UTF-8 octets; looks correct on UTF-8 terminal print $octets, $/; # print text containing wide character to narrow character filehandle print $txt WARN$/; # triggers a warning: Wide character in print binmode STDOUT, ':utf8'; # set to utf8, accepting wide characters print $txt, $/; # print text to terminal print $octets, $/; # double encoding, č as four bytes my $parser = XML::LibXML-new; # specify encoding for octet string my $doc = $parser-parse_html_string($octets, {encoding = 'utf-8'}); print $doc-documentElement-toString, $/; # no need to specify encoding for text string my $doc2 = $parser-parse_html_string($txt); print $doc2-documentElement-toString, $/; -- Michael Ludwig
Re: Variation In Decoding Between Encode and XML::LibXML
At 00:27 +0100 18/6/10, I wrote: If I save the file and undo the second decoding I get the proper output In this case all talk of iso-8859-1 and cp1252 is a red herring. I read several Italian websites where this same problem is manifest in external material such as ads. The news page proper is encoded properly and declared as utf-8 but I imagine the web designers have reckoned that the stuff they receive from the advertisers is most likely to be received as windows-1252 and convert accordingly rather than bother to verify the encoding. As a result material that is received as utf-8 will undergo a superfluous encoding. Here's a way to get the file in question properly encoded: #!/usr/bin/perl use strict; use LWP::Simple; use Encode; no warnings; # avoid wide character warning my $tempdir = /tmp; my $tempfile = tempfile; my $f = $tempdir/$tempfile; my $uri=http://pipes.yahoo.com/pipes/pipe.run;. ?Size=Medium_id=f53b7bed8b88412fab9715a995629722. _render=rssmax=50nsid=1025993%40N22; if (getstore($uri, $f)){ open F, $f or die $!; while (F){ my $encoding = find_encoding(utf-8); my $utf8 = $encoding-decode($_); print $utf8; } close F; } unlink $f; JD
Re: Variation In Decoding Between Encode and XML::LibXML
On Jun 18, 2010, at 12:05 AM, John Delacour wrote: In this case all talk of iso-8859-1 and cp1252 is a red herring. I read several Italian websites where this same problem is manifest in external material such as ads. The news page proper is encoded properly and declared as utf-8 but I imagine the web designers have reckoned that the stuff they receive from the advertisers is most likely to be received as windows-1252 and convert accordingly rather than bother to verify the encoding. As a result material that is received as utf-8 will undergo a superfluous encoding. Here's a way to get the file in question properly encoded: Yep, that works for me, too. I guess XML::LibXML isn't using Encode in the same way to decode content, as it returns the string with the characters as \x{c4}\x{8d}. Thanks for the help, everyone. I've got my code parsing all my feeds and emitting a valid UTF-8 feed of its own now. Best, David
Re: Variation In Decoding Between Encode and XML::LibXML
On Jun 17, 2010, at 12:30 PM, Henning Michael Møller Just wrote: So it may be valid UTF-8, but why does it come out looking like crap? That is, LaurinaviÃ≥Ÿius? I suppose there's an argument that LaurinaviÄŸius is correct and valid, if ugly. Maybe? I am unsure if this is the explanation you are looking for but here goes: I think the original data contained the character \x{010d}. In utf-8, that means that it should be represented as the bytes \x{c4} and \x{8d}. If those bytes are not marked as in fact being a two-byte utf-8 encoding of a single character, or if an application reading the data mistakenly thinks it is not encoded (both common errors), somewhere along the transmission an application may decide that it needs to re-encode the characters in utf-8. So the original character \x{010d} is represented by the bytes \x{c4} and \x{8d}, an application thinks those are in fact characters and encodes them again as \x{c3} + \x{84} and \x{c2} + \x{8d}, respectively. Which I believe is your broken data. I see. That makes sense. FYI, the original source is at: http://pipes.yahoo.com/pipes/pipe.run?Size=Medium_id=f53b7bed8b88412fab9715a995629722_render=rssmax=50nsid=1025993%40N22 Look for Tomas in the output. If it doesn't show pu, change max=50 to max=75 or something. I think the error comes from Perl's handling of utf-8 data and that this handling has changed in subtle ways all the way since Perl 5.6. We have supported utf-8 in our applications since Perl 5.6 and have experienced this repeatedly. Any major upgrade of Perl or indeed the much needed upgrade of DBD::ODBC Martin Evans provided have given us a lot of work trying to sort out these troubles. Maintaining the backwards compatibility from the pre-utf8 days must make it far more difficult than it otherwise would be. I wonder if your code would work fine in Perl 5.8? We are only at 5.10(.1) but the upgrade from 5.8 to 5.10 also gave us some utf-8 trouble. If it works fine in Perl 5.8 maybe the error is in an assumption somewhere in XML::LibXML? In my application, I finally got XML::LibXML to choke on the invalid characters, and then found that the problem was that I was running Encode::CP1252::zap_cp1252 against the string before passing it to XML::LibXML. Once I removed that, it stopped choking. So clearly zap_cp1252 was changing bytes it should not have. I now have it running fix_cp1252 *after* the parsing, when everything is already UTF-8. Now that I think about it, though, I should probably change it so that it searches on characters instead of bytes when working on a utf8 string. Will have to look into that. In the meantime, I'll just accept that sometimes the characters are valid UTF-8 and look like shit. Frankly, when I run the above feed through NetNewsWire, the offending byte sequence displays as Ä, just as it does in my app's output. So I blame Yahoo. Thanks for the detailed explanation, Henning, much appreciated. Best, David
Re: Variation In Decoding Between Encode and XML::LibXML
At 13:24 -0700 17/6/10, David E. Wheeler wrote: On Jun 17, 2010, at 12:30 PM, Henning Michael Møller Just wrote: So the original character \x{010d} is represented by the bytes \x{c4} and \x{8d}, an application thinks those are in fact characters and encodes them again as \x{c3} + \x{84} and \x{c2} + \x{8d}, respectively. Which I believe is your broken data. I see. That makes sense. FYI, the original source is at: http://pipes.yahoo.com/pipes/pipe.run?Size=Medium_id=f53b7bed8b88412fab9715a995629722_render=rssmax=50nsid=1025993%40N22 In the meantime, I'll just accept that sometimes the characters are valid UTF-8 and look like shit. Frankly, when I run the above feed through NetNewsWire, the offending byte sequence displays as Ä, just as it does in my app's output. So I blame Yahoo. Quite right. Now I see the file it is clear that the encoding has been done twice, each of the two bytes for the c-with-caron being again encoded to produce four bytes. If I save the file and undo the second decoding I get the proper output #!/usr/bin/perl use strict; use Encode; no warnings; my $f = $ENV{HOME}/desktop/pipe.run; open F, $f; while (F){ print decode(utf-8, $_) } JD
RE: Variation In Decoding Between Encode and XML::LibXML
Hello (loved your PostgreSQL presentation at the most recent OSCON, BTW) Which editor do you use? When loading the script in Komodo IDE 5.2 the string looks broken. Running the script (ActivePerl 5.10.1 on Windows) only the second line is correct - the first (no surprise) and third are broken. Loading the file in UltraEdit-32 13.20+3, set to not convert the script on loading, it becomes obvious that what should have been one character is represented by 4 bytes, \xC3 \x84 \xC2 \x8D, which modern editors would probably show as 2 characters and as broken. It looks to me like the string is being displayed as a byte representation of the characters, if that makes sense. My english isn't perfect :-/ and what I am trying to say is that this is problem that I am quite familiar with. It happens whenever the source and the reader do not agree on whether a string is encoded in utf-8 or not. Apparently Encode fixes the incorrect string which is nice. The interesting thing is, where should this be fixed? If it's at Yahoo! Pipes you'll probably have to use Encode as a work-around for some time... Best regards Henning Michael Møller Just -Original Message- From: David E. Wheeler [mailto:da...@kineticode.com] Sent: Wednesday, June 16, 2010 7:56 AM To: perl-unicode@perl.org Subject: Variation In Decoding Between Encode and XML::LibXML Fellow Perlers, I'm parsing a lot of XML these days, and came upon a a Yahoo! Pipes feed that appears to mangle an originating Flickr feed. But the curious thing is, when I pull the offending string out of the RSS and just stick it in a script, Encode knows how to decode it properly, while XML::LibXML (and my Unicode-aware editors) cannot. The attached script demonstrates. $str has the bogus-looking character. Encode, however, seems to properly convert it to the č in Laurinavičius in the output. XML::LibXML, OTOH, outputs it as LaurinaviÄius -- that is, broken. (If things look truly borked in this email too, please look at the attached script.) So my question is, what gives? Is this truly a broken representation of the character and Encode just figures that out and fixes it? Or is there something off with my editor and with XML::LibXML. FWIW, the character looks correct in my editor when I load it from the original Flickr feed. It's only after processing by Yahoo! Pipes that it comes out looking mangled. Any insights would be appreciated. Best, David
Re: Variation In Decoding Between Encode and XML::LibXML
On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote: I think what I need is some code to strip non-utf8 characters from a string -- even if that string has the utf8 bit switched on. I thought that Encode would do that for me, but in this case apparently not. Anyone got an example? Tri this: Encode::_utf8_off($string); $string = Encode::decode('utf8', $string); That will replace any byte sequences which are invalid UTF-8 with the Unicode replacement character. If you want to guarantee that the flag is on first, do this: utf8::upgrade($string); Encode::_utf8_off($string); $string = Encode::decode('utf8', $string); Devel::Peek's Dump() function will come in handy for checking results. Cheers, Marvin Humphrey
Re: Variation In Decoding Between Encode and XML::LibXML
On Jun 16, 2010, at 4:47 PM, Marvin Humphrey wrote: On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote: I think what I need is some code to strip non-utf8 characters from a string -- even if that string has the utf8 bit switched on. I thought that Encode would do that for me, but in this case apparently not. Anyone got an example? Tri this: Encode::_utf8_off($string); $string = Encode::decode('utf8', $string); That will replace any byte sequences which are invalid UTF-8 with the Unicode replacement character. Yeah. Not working for me. See attached script. Devel::Peek says: SV = PV(0x100801f18) at 0x10082f368 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x1002015c0 pTomas Laurinavi\303\204\302\215ius/p\0 [UTF8 pTomas Laurinavi\x{c4}\x{8d}ius/p] CUR = 29 LEN = 32 So the UTF8 flag is enabled, and yet it has \303\204\302\215 in it. What is that crap? Confused and frustrated, David #!/usr/local/bin/perl -w use 5.12.0; use Encode; use Devel::Peek; my $str = 'pTomas LaurinaviÃÂius/p'; my $utf8 = decode('UTF-8', $str); say $str; binmode STDOUT, ':utf8'; say $utf8; Dump($utf8);
Re: Variation In Decoding Between Encode and XML::LibXML
On Wed, Jun 16, 2010 at 05:34:44PM -0700, David E. Wheeler wrote: So the UTF8 flag is enabled, and yet it has \303\204\302\215 in it. What is that crap? That's octal notation, which I think Dump() uses for any byte greater than 127 and for control characters, so that it can output pure ASCII. That sequence is only four bytes: mar...@smokey:~ $ perl -MEncode -MDevel::Peek -e '$s = \303\204\302\215; Encode::_utf8_on($s); Dump $s' SV = PV(0x801038) at 0x80e880 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x2012f0 \303\204\302\215\0 [UTF8 \x{c4}\x{8d}] CUR = 4 --- four bytes LEN = 8 mar...@smokey:~ $ The logical content of the string follows in the second quote: [UTF8 pTomas Laurinavi\x{c4}\x{8d}ius/p] That's valid UTF-8. my $str = 'pTomas Laurinaviius/p'; In source code, I try to stick to pure ASCII and use \x escapes -- like Dump() does. my $str = pTomas Laurinavi\x{c4}\x{8d}ius/p However, because those code points are both representable as Latin-1, Perl will create a Latin-1 string. If you want to force its internal encoding to UTF-8, you need to do additional work. mar...@smokey:~ $ perl -MDevel::Peek -e '$s = \x{c4}; Dump $s; utf8::upgrade($s); Dump $s' SV = PV(0x801038) at 0x80e870 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x2012e0 \304\0 CUR = 1 LEN = 4 SV = PV(0x801038) at 0x80e870 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x2008f0 \303\204\0 [UTF8 \x{c4}] CUR = 2 LEN = 3 mar...@smokey:~ $ Confused and frustrated, IMO, to get UTF-8 right consistently in a large Perl system, you need to understand the internals and you need Devel::Peek at hand. Perl tries to hide the details, but there are too many ways for it to fail silently. (perl -C, $YAML::Syck::ImplicitUnicode, etc.) Marvin Humphrey