Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-18 Thread David E. Wheeler
On Jun 18, 2010, at 12:05 AM, John Delacour wrote:

 In this case all talk of iso-8859-1 and cp1252 is a red herring.  I read 
 several Italian websites where this same problem is manifest in external 
 material such as ads.  The news page proper is encoded properly and declared 
 as utf-8 but I imagine the web designers have reckoned that the stuff they 
 receive from the advertisers is most likely to be received as windows-1252 
 and convert accordingly rather than bother to verify the encoding.  As a 
 result material that is received as utf-8 will undergo a superfluous encoding.
 
 Here's a way to get the file in question properly encoded:

Yep, that works for me, too. I guess XML::LibXML isn't using Encode in the same 
way to decode content, as it returns the string with the characters as 
\x{c4}\x{8d}.

Thanks for the help, everyone. I've got my code parsing all my feeds and 
emitting a valid UTF-8 feed of its own now.

Best,

David

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-17 Thread David E. Wheeler
On Jun 17, 2010, at 12:30 PM, Henning Michael Møller Just wrote:

 So it may be valid UTF-8, but why does it come out looking like crap? That 
 is, LaurinaviÃ≥Ÿius? I suppose there's an  argument that 
 LaurinaviÄŸius is correct and valid, if ugly. Maybe?
 
 I am unsure if this is the explanation you are looking for but here goes:
 
 I think the original data contained the character \x{010d}. In utf-8, that 
 means that it should be represented as the bytes \x{c4} and \x{8d}. If those 
 bytes are not marked as in fact being a two-byte utf-8 encoding of a single 
 character, or if an application reading the data mistakenly thinks it is not 
 encoded (both common errors), somewhere along the transmission an application 
 may decide that it needs to re-encode the characters in utf-8. 
 
 So the original character \x{010d} is represented by the bytes \x{c4} and 
 \x{8d}, an application thinks those are in fact characters and encodes them 
 again as \x{c3} + \x{84} and \x{c2} + \x{8d}, respectively. Which I believe 
 is your broken data.

I see. That makes sense. FYI, the original source is at:

  
http://pipes.yahoo.com/pipes/pipe.run?Size=Medium_id=f53b7bed8b88412fab9715a995629722_render=rssmax=50nsid=1025993%40N22

Look for Tomas in the output. If it doesn't show pu, change max=50 to max=75 
or something.

 I think the error comes from Perl's handling of utf-8 data and that this 
 handling has changed in subtle ways all the way since Perl 5.6. We have 
 supported utf-8 in our applications since Perl 5.6 and have experienced this 
 repeatedly. Any major upgrade of Perl or indeed the much needed upgrade of 
 DBD::ODBC Martin Evans provided have given us a lot of work trying to sort 
 out these troubles.

Maintaining the backwards compatibility from the pre-utf8 days must make it far 
more difficult than it otherwise would be.

 I wonder if your code would work fine in Perl 5.8? We are only at 5.10(.1) 
 but the upgrade from 5.8 to 5.10 also gave us some utf-8 trouble. If it works 
 fine in Perl 5.8 maybe the error is in an assumption somewhere in XML::LibXML?

In my application, I finally got XML::LibXML to choke on the invalid 
characters, and then found that the problem was that I was running 
Encode::CP1252::zap_cp1252 against the string before passing it to XML::LibXML. 
Once I removed that, it stopped choking. So clearly zap_cp1252 was changing 
bytes it should not have. I now have it running fix_cp1252 *after* the parsing, 
when everything is already UTF-8. Now that I think about it, though, I should 
probably change it so that it searches on characters instead of bytes when 
working on a utf8 string. Will have to look into that.

In the meantime, I'll just accept that sometimes the characters are valid UTF-8 
and look like shit. Frankly, when I run the above feed through NetNewsWire, the 
offending byte sequence displays as Ä, just as it does in my app's output. So 
I blame Yahoo.

Thanks for the detailed explanation, Henning, much appreciated.

Best,

David

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread David E. Wheeler
On Jun 16, 2010, at 4:47 PM, Marvin Humphrey wrote:

 On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote:
 I think what I need is some code to strip non-utf8 characters from a string
 -- even if that string has the utf8 bit switched on. I thought that Encode
 would do that for me, but in this case apparently not. Anyone got an
 example?
 
 Tri this:
 
Encode::_utf8_off($string);
$string = Encode::decode('utf8', $string);
 
 That will replace any byte sequences which are invalid UTF-8 with the Unicode
 replacement character.  

Yeah. Not working for me. See attached script. Devel::Peek says:

SV = PV(0x100801f18) at 0x10082f368
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x1002015c0 pTomas Laurinavi\303\204\302\215ius/p\0 [UTF8 
pTomas Laurinavi\x{c4}\x{8d}ius/p]
  CUR = 29
  LEN = 32

So the UTF8 flag is enabled, and yet it has \303\204\302\215 in it. What is 
that crap?

Confused and frustrated,

David
#!/usr/local/bin/perl -w

use 5.12.0;
use Encode;
use Devel::Peek;

my $str = 'pTomas Laurinavičius/p';
my $utf8 = decode('UTF-8', $str);
say $str;
binmode STDOUT, ':utf8';
say $utf8;

Dump($utf8);


Re: making utf8-clean CPAN distributions

2004-12-12 Thread David E . Wheeler
On Dec 12, 2004, at 10:06 PM, Darren Duncan wrote:
What I would like to do is create my CPAN module distributions such 
that all of the files in each distro, code and documentation and tests 
and logs alike, are properly UTF-8 encoded, and do this in such a way 
that no modern Perl distributions or the automated CPAN tools will 
break.
Short answer:
use utf8;
=pod
=encoding utf8
=cut
Regards,
David