Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-19 Thread Michael Ludwig
David E. Wheeler schrieb am 16.06.2010 um 13:59 (-0700):
 On Jun 16, 2010, at 9:05 AM, David E. Wheeler wrote:
  On Jun 16, 2010, at 2:34 AM, Michael Ludwig wrote:

  In order to print Unicode text strings (as opposed to octet
  strings) correctly to a terminal (UTF-8 or not), add the following
  line before the first output:
  
  binmode STDOUT, ':utf8';
  
  But note that STDOUT is global.
  
  Yes, I do this all the time. Surprisingly, I don't get warnings for
  this script, even though it is outputting multibyte characters.
 
 This is key. If I set the binmode on STDOUT to :utf8, the bogus
 characters print out bogus. If I set it to :raw, they come out right
 after processing by both Encode and XML::LibXML (I'm assuming they're
 interpreted as latin-1).

Yes, or as raw, which is equivalent. Any octet is valid Latin-1.

 So my question is this: Why isn't Encode dying when it runs into these
 characters? They're not valid utf-8, AFAICT. Are they somehow valid
 utf8 (that is, valid in Perl's internal format)? Why would they be?

Assuming we're talking about the same thing here: They're not
characters, they're octets. (The Perl documentation seems to make
an effort to conceptually distinguish between *octets* and *bytes*,
but they map to the same thing.) I found it helpful to accept that
the notion of a UTF-8 character does not make sense: there are
Unicode characters, but UTF-8 is an encoding, and it deals with
octets.

Here's your script with some modifications to illustrate how things
work:

  \,,,/
  (o o)
--oOOo-(_)-oOOo--
use strict;
use Encode;
use XML::LibXML;
# The script is written in UTF-8, but the utf8 pragma is not turned on.
# So the literals in our script yield octet strings, not text strings.
# (Note that it is probably much more convenient to go with the utf8
# pragma if you write your source code in UTF-8.)
my $octets = 'pTomas Laurinavičius/p';
my $txt= decode_utf8( $octets );
my $txt2   = pTomas Laurinavi\x{010d}ius/p;

die if $txt2 ne $txt;# they're equal
die if $txt2 eq $octets; # they're not equal

# print raw UTF-8 octets; looks correct on UTF-8 terminal
print $octets, $/;
# print text containing wide character to narrow character filehandle
print $txt WARN$/; # triggers a warning: Wide character in print
binmode STDOUT, ':utf8'; # set to utf8, accepting wide characters
print $txt, $/; # print text to terminal
print $octets, $/; # double encoding, č as four bytes

my $parser = XML::LibXML-new;
# specify encoding for octet string
my $doc = $parser-parse_html_string($octets, {encoding = 'utf-8'});
print $doc-documentElement-toString, $/;
# no need to specify encoding for text string
my $doc2 = $parser-parse_html_string($txt);
print $doc2-documentElement-toString, $/;
-- 
Michael Ludwig


Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-18 Thread John Delacour

At 00:27 +0100 18/6/10, I wrote:


If I save the file and undo the second decoding I get the proper output



In this case all talk of iso-8859-1 and cp1252 is a red herring.  I 
read several Italian websites where this same problem is manifest in 
external material such as ads.  The news page proper is encoded 
properly and declared as utf-8 but I imagine the web designers have 
reckoned that the stuff they receive from the advertisers is most 
likely to be received as windows-1252 and convert accordingly rather 
than bother to verify the encoding.  As a result material that is 
received as utf-8 will undergo a superfluous encoding.


Here's a way to get the file in question properly encoded:


#!/usr/bin/perl
use strict;
use LWP::Simple;
use Encode;
no warnings; # avoid wide character warning
my $tempdir = /tmp;
my $tempfile = tempfile;
my $f = $tempdir/$tempfile;
my $uri=http://pipes.yahoo.com/pipes/pipe.run;.
?Size=Medium_id=f53b7bed8b88412fab9715a995629722.
_render=rssmax=50nsid=1025993%40N22;
if (getstore($uri, $f)){
  open F, $f or die $!;
  while (F){
my $encoding = find_encoding(utf-8);
my $utf8 = $encoding-decode($_);
print $utf8;
  }
  close F;
}
unlink $f;

JD



Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-18 Thread David E. Wheeler
On Jun 18, 2010, at 12:05 AM, John Delacour wrote:

 In this case all talk of iso-8859-1 and cp1252 is a red herring.  I read 
 several Italian websites where this same problem is manifest in external 
 material such as ads.  The news page proper is encoded properly and declared 
 as utf-8 but I imagine the web designers have reckoned that the stuff they 
 receive from the advertisers is most likely to be received as windows-1252 
 and convert accordingly rather than bother to verify the encoding.  As a 
 result material that is received as utf-8 will undergo a superfluous encoding.
 
 Here's a way to get the file in question properly encoded:

Yep, that works for me, too. I guess XML::LibXML isn't using Encode in the same 
way to decode content, as it returns the string with the characters as 
\x{c4}\x{8d}.

Thanks for the help, everyone. I've got my code parsing all my feeds and 
emitting a valid UTF-8 feed of its own now.

Best,

David

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-17 Thread David E. Wheeler
On Jun 17, 2010, at 12:30 PM, Henning Michael Møller Just wrote:

 So it may be valid UTF-8, but why does it come out looking like crap? That 
 is, LaurinaviÃ≥Ÿius? I suppose there's an  argument that 
 LaurinaviÄŸius is correct and valid, if ugly. Maybe?
 
 I am unsure if this is the explanation you are looking for but here goes:
 
 I think the original data contained the character \x{010d}. In utf-8, that 
 means that it should be represented as the bytes \x{c4} and \x{8d}. If those 
 bytes are not marked as in fact being a two-byte utf-8 encoding of a single 
 character, or if an application reading the data mistakenly thinks it is not 
 encoded (both common errors), somewhere along the transmission an application 
 may decide that it needs to re-encode the characters in utf-8. 
 
 So the original character \x{010d} is represented by the bytes \x{c4} and 
 \x{8d}, an application thinks those are in fact characters and encodes them 
 again as \x{c3} + \x{84} and \x{c2} + \x{8d}, respectively. Which I believe 
 is your broken data.

I see. That makes sense. FYI, the original source is at:

  
http://pipes.yahoo.com/pipes/pipe.run?Size=Medium_id=f53b7bed8b88412fab9715a995629722_render=rssmax=50nsid=1025993%40N22

Look for Tomas in the output. If it doesn't show pu, change max=50 to max=75 
or something.

 I think the error comes from Perl's handling of utf-8 data and that this 
 handling has changed in subtle ways all the way since Perl 5.6. We have 
 supported utf-8 in our applications since Perl 5.6 and have experienced this 
 repeatedly. Any major upgrade of Perl or indeed the much needed upgrade of 
 DBD::ODBC Martin Evans provided have given us a lot of work trying to sort 
 out these troubles.

Maintaining the backwards compatibility from the pre-utf8 days must make it far 
more difficult than it otherwise would be.

 I wonder if your code would work fine in Perl 5.8? We are only at 5.10(.1) 
 but the upgrade from 5.8 to 5.10 also gave us some utf-8 trouble. If it works 
 fine in Perl 5.8 maybe the error is in an assumption somewhere in XML::LibXML?

In my application, I finally got XML::LibXML to choke on the invalid 
characters, and then found that the problem was that I was running 
Encode::CP1252::zap_cp1252 against the string before passing it to XML::LibXML. 
Once I removed that, it stopped choking. So clearly zap_cp1252 was changing 
bytes it should not have. I now have it running fix_cp1252 *after* the parsing, 
when everything is already UTF-8. Now that I think about it, though, I should 
probably change it so that it searches on characters instead of bytes when 
working on a utf8 string. Will have to look into that.

In the meantime, I'll just accept that sometimes the characters are valid UTF-8 
and look like shit. Frankly, when I run the above feed through NetNewsWire, the 
offending byte sequence displays as Ä, just as it does in my app's output. So 
I blame Yahoo.

Thanks for the detailed explanation, Henning, much appreciated.

Best,

David

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-17 Thread John Delacour

At 13:24 -0700 17/6/10, David E. Wheeler wrote:

On Jun 17, 2010, at 12:30 PM, Henning Michael Møller Just wrote:


 So the original character \x{010d} is represented by the bytes 
\x{c4} and \x{8d}, an application thinks those are in fact 
characters and encodes them again as \x{c3} + \x{84} and \x{c2} + 
\x{8d}, respectively. Which I believe is your broken data.


I see. That makes sense. FYI, the original source is at:


http://pipes.yahoo.com/pipes/pipe.run?Size=Medium_id=f53b7bed8b88412fab9715a995629722_render=rssmax=50nsid=1025993%40N22




In the meantime, I'll just accept that sometimes the characters are 
valid UTF-8 and look like shit. Frankly, when I run the above feed 
through NetNewsWire, the offending byte sequence displays as Ä, 
just as it does in my app's output. So I blame Yahoo.



Quite right.  Now I see the file it is clear that the encoding has 
been done twice, each of the two bytes for the c-with-caron being 
again encoded to produce four bytes.


If I save the file and undo the second decoding I get the proper output


#!/usr/bin/perl
use strict;
use Encode;
no warnings;
my $f = $ENV{HOME}/desktop/pipe.run;
open F, $f;
while (F){
print decode(utf-8, $_)
}



JD



RE: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Henning Michael Møller Just
Hello (loved your PostgreSQL presentation at the most recent OSCON, BTW)

Which editor do you use? When loading the script in Komodo IDE 5.2 the string 
looks broken. Running the script (ActivePerl 5.10.1 on Windows) only the second 
line is correct - the first (no surprise) and third are broken.

Loading the file in UltraEdit-32 13.20+3, set to not convert the script on 
loading, it becomes obvious that what should have been one character is 
represented by 4 bytes, \xC3 \x84 \xC2 \x8D, which modern editors would 
probably show as 2 characters and as broken.

It looks to me like the string is being displayed as a byte representation of 
the characters, if that makes sense. My english isn't perfect :-/ and what I am 
trying to say is that this is problem that I am quite familiar with. It happens 
whenever the source and the reader do not agree on whether a string is encoded 
in utf-8 or not.

Apparently Encode fixes the incorrect string which is nice. The interesting 
thing is, where should this be fixed? If it's at Yahoo! Pipes you'll probably 
have to use Encode as a work-around for some time...


Best regards
Henning Michael Møller Just




-Original Message-
From: David E. Wheeler [mailto:da...@kineticode.com] 
Sent: Wednesday, June 16, 2010 7:56 AM
To: perl-unicode@perl.org
Subject: Variation In Decoding Between Encode and XML::LibXML

Fellow Perlers,

I'm parsing a lot of XML these days, and came upon a a Yahoo! Pipes feed that 
appears to mangle an originating Flickr feed. But the curious thing is, when I 
pull the offending string out of the RSS and just stick it in a script, Encode 
knows how to decode it properly, while XML::LibXML (and my Unicode-aware 
editors) cannot.

The attached script demonstrates. $str has the bogus-looking character. 
Encode, however, seems to properly convert it to the č in Laurinavičius in 
the output. XML::LibXML, OTOH, outputs it as Laurinavičius -- that is, 
broken. (If things look truly borked in this email too, please look at the 
attached script.)

So my question is, what gives? Is this truly a broken representation of the 
character and Encode just figures that out and fixes it? Or is there something 
off with my editor and with XML::LibXML.

FWIW, the character looks correct in my editor when I load it from the original 
Flickr feed. It's only after processing by Yahoo! Pipes that it comes out 
looking mangled.

Any insights would be appreciated.

Best,

David




Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Marvin Humphrey
On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote:
 I think what I need is some code to strip non-utf8 characters from a string
 -- even if that string has the utf8 bit switched on. I thought that Encode
 would do that for me, but in this case apparently not. Anyone got an
 example?

Tri this:

Encode::_utf8_off($string);
$string = Encode::decode('utf8', $string);

That will replace any byte sequences which are invalid UTF-8 with the Unicode
replacement character.  

If you want to guarantee that the flag is on first, do this:

utf8::upgrade($string);
Encode::_utf8_off($string);
$string = Encode::decode('utf8', $string);

Devel::Peek's Dump() function will come in handy for checking results.

Cheers,

Marvin Humphrey



Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread David E. Wheeler
On Jun 16, 2010, at 4:47 PM, Marvin Humphrey wrote:

 On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote:
 I think what I need is some code to strip non-utf8 characters from a string
 -- even if that string has the utf8 bit switched on. I thought that Encode
 would do that for me, but in this case apparently not. Anyone got an
 example?
 
 Tri this:
 
Encode::_utf8_off($string);
$string = Encode::decode('utf8', $string);
 
 That will replace any byte sequences which are invalid UTF-8 with the Unicode
 replacement character.  

Yeah. Not working for me. See attached script. Devel::Peek says:

SV = PV(0x100801f18) at 0x10082f368
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x1002015c0 pTomas Laurinavi\303\204\302\215ius/p\0 [UTF8 
pTomas Laurinavi\x{c4}\x{8d}ius/p]
  CUR = 29
  LEN = 32

So the UTF8 flag is enabled, and yet it has \303\204\302\215 in it. What is 
that crap?

Confused and frustrated,

David
#!/usr/local/bin/perl -w

use 5.12.0;
use Encode;
use Devel::Peek;

my $str = 'pTomas Laurinavičius/p';
my $utf8 = decode('UTF-8', $str);
say $str;
binmode STDOUT, ':utf8';
say $utf8;

Dump($utf8);


Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Marvin Humphrey
On Wed, Jun 16, 2010 at 05:34:44PM -0700, David E. Wheeler wrote:

 So the UTF8 flag is enabled, and yet it has \303\204\302\215 in it. What is 
 that crap?

That's octal notation, which I think Dump() uses for any byte greater than 127
and for control characters, so that it can output pure ASCII.  

That sequence is only four bytes: 
  
  mar...@smokey:~ $ perl -MEncode -MDevel::Peek -e '$s = \303\204\302\215; 
Encode::_utf8_on($s); Dump $s'
  SV = PV(0x801038) at 0x80e880
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x2012f0 \303\204\302\215\0 [UTF8 \x{c4}\x{8d}]
CUR = 4   --- four bytes
LEN = 8
  mar...@smokey:~ $ 

The logical content of the string follows in the second quote:

  [UTF8 pTomas Laurinavi\x{c4}\x{8d}ius/p]

That's valid UTF-8.

 my $str = 'pTomas Laurinaviius/p';

In source code, I try to stick to pure ASCII and use \x escapes -- like Dump()
does.

  my $str = pTomas Laurinavi\x{c4}\x{8d}ius/p

However, because those code points are both representable as Latin-1, Perl
will create a Latin-1 string.  If you want to force its internal encoding to
UTF-8, you need to do additional work.

  mar...@smokey:~ $ perl -MDevel::Peek -e '$s = \x{c4}; Dump $s; 
utf8::upgrade($s); Dump $s'
  SV = PV(0x801038) at 0x80e870
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x2012e0 \304\0
CUR = 1
LEN = 4
  SV = PV(0x801038) at 0x80e870
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x2008f0 \303\204\0 [UTF8 \x{c4}]
CUR = 2
LEN = 3
  mar...@smokey:~ $ 

 Confused and frustrated,

IMO, to get UTF-8 right consistently in a large Perl system, you need to
understand the internals and you need Devel::Peek at hand.  Perl tries to hide
the details, but there are too many ways for it to fail silently.  (perl -C,
$YAML::Syck::ImplicitUnicode, etc.)

Marvin Humphrey