RE: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Henning Michael Møller Just
Hello (loved your PostgreSQL presentation at the most recent OSCON, BTW)

Which editor do you use? When loading the script in Komodo IDE 5.2 the string 
looks broken. Running the script (ActivePerl 5.10.1 on Windows) only the second 
line is correct - the first (no surprise) and third are broken.

Loading the file in UltraEdit-32 13.20+3, set to not convert the script on 
loading, it becomes obvious that what should have been one character is 
represented by 4 bytes, \xC3 \x84 \xC2 \x8D, which modern editors would 
probably show as 2 characters and as broken.

It looks to me like the string is being displayed as a byte representation of 
the characters, if that makes sense. My english isn't perfect :-/ and what I am 
trying to say is that this is problem that I am quite familiar with. It happens 
whenever the source and the reader do not agree on whether a string is encoded 
in utf-8 or not.

Apparently Encode fixes the incorrect string which is nice. The interesting 
thing is, where should this be fixed? If it's at Yahoo! Pipes you'll probably 
have to use Encode as a work-around for some time...


Best regards
Henning Michael Møller Just




-Original Message-
From: David E. Wheeler [mailto:da...@kineticode.com] 
Sent: Wednesday, June 16, 2010 7:56 AM
To: perl-unicode@perl.org
Subject: Variation In Decoding Between Encode and XML::LibXML

Fellow Perlers,

I'm parsing a lot of XML these days, and came upon a a Yahoo! Pipes feed that 
appears to mangle an originating Flickr feed. But the curious thing is, when I 
pull the offending string out of the RSS and just stick it in a script, Encode 
knows how to decode it properly, while XML::LibXML (and my Unicode-aware 
editors) cannot.

The attached script demonstrates. $str has the bogus-looking character. 
Encode, however, seems to properly convert it to the č in Laurinavičius in 
the output. XML::LibXML, OTOH, outputs it as Laurinavičius -- that is, 
broken. (If things look truly borked in this email too, please look at the 
attached script.)

So my question is, what gives? Is this truly a broken representation of the 
character and Encode just figures that out and fixes it? Or is there something 
off with my editor and with XML::LibXML.

FWIW, the character looks correct in my editor when I load it from the original 
Flickr feed. It's only after processing by Yahoo! Pipes that it comes out 
looking mangled.

Any insights would be appreciated.

Best,

David




Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Marvin Humphrey
On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote:
 I think what I need is some code to strip non-utf8 characters from a string
 -- even if that string has the utf8 bit switched on. I thought that Encode
 would do that for me, but in this case apparently not. Anyone got an
 example?

Tri this:

Encode::_utf8_off($string);
$string = Encode::decode('utf8', $string);

That will replace any byte sequences which are invalid UTF-8 with the Unicode
replacement character.  

If you want to guarantee that the flag is on first, do this:

utf8::upgrade($string);
Encode::_utf8_off($string);
$string = Encode::decode('utf8', $string);

Devel::Peek's Dump() function will come in handy for checking results.

Cheers,

Marvin Humphrey



Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread David E. Wheeler
On Jun 16, 2010, at 4:47 PM, Marvin Humphrey wrote:

 On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote:
 I think what I need is some code to strip non-utf8 characters from a string
 -- even if that string has the utf8 bit switched on. I thought that Encode
 would do that for me, but in this case apparently not. Anyone got an
 example?
 
 Tri this:
 
Encode::_utf8_off($string);
$string = Encode::decode('utf8', $string);
 
 That will replace any byte sequences which are invalid UTF-8 with the Unicode
 replacement character.  

Yeah. Not working for me. See attached script. Devel::Peek says:

SV = PV(0x100801f18) at 0x10082f368
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x1002015c0 pTomas Laurinavi\303\204\302\215ius/p\0 [UTF8 
pTomas Laurinavi\x{c4}\x{8d}ius/p]
  CUR = 29
  LEN = 32

So the UTF8 flag is enabled, and yet it has \303\204\302\215 in it. What is 
that crap?

Confused and frustrated,

David
#!/usr/local/bin/perl -w

use 5.12.0;
use Encode;
use Devel::Peek;

my $str = 'pTomas Laurinavičius/p';
my $utf8 = decode('UTF-8', $str);
say $str;
binmode STDOUT, ':utf8';
say $utf8;

Dump($utf8);


Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Marvin Humphrey
On Wed, Jun 16, 2010 at 05:34:44PM -0700, David E. Wheeler wrote:

 So the UTF8 flag is enabled, and yet it has \303\204\302\215 in it. What is 
 that crap?

That's octal notation, which I think Dump() uses for any byte greater than 127
and for control characters, so that it can output pure ASCII.  

That sequence is only four bytes: 
  
  mar...@smokey:~ $ perl -MEncode -MDevel::Peek -e '$s = \303\204\302\215; 
Encode::_utf8_on($s); Dump $s'
  SV = PV(0x801038) at 0x80e880
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x2012f0 \303\204\302\215\0 [UTF8 \x{c4}\x{8d}]
CUR = 4   --- four bytes
LEN = 8
  mar...@smokey:~ $ 

The logical content of the string follows in the second quote:

  [UTF8 pTomas Laurinavi\x{c4}\x{8d}ius/p]

That's valid UTF-8.

 my $str = 'pTomas Laurinaviius/p';

In source code, I try to stick to pure ASCII and use \x escapes -- like Dump()
does.

  my $str = pTomas Laurinavi\x{c4}\x{8d}ius/p

However, because those code points are both representable as Latin-1, Perl
will create a Latin-1 string.  If you want to force its internal encoding to
UTF-8, you need to do additional work.

  mar...@smokey:~ $ perl -MDevel::Peek -e '$s = \x{c4}; Dump $s; 
utf8::upgrade($s); Dump $s'
  SV = PV(0x801038) at 0x80e870
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x2012e0 \304\0
CUR = 1
LEN = 4
  SV = PV(0x801038) at 0x80e870
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x2008f0 \303\204\0 [UTF8 \x{c4}]
CUR = 2
LEN = 3
  mar...@smokey:~ $ 

 Confused and frustrated,

IMO, to get UTF-8 right consistently in a large Perl system, you need to
understand the internals and you need Devel::Peek at hand.  Perl tries to hide
the details, but there are too many ways for it to fail silently.  (perl -C,
$YAML::Syck::ImplicitUnicode, etc.)

Marvin Humphrey