HTML::Entities chokes on XML::Parser strings

2002-05-07 Thread John Siracusa

I ran into this problem during mod_perl development, and I'm posting it to
this list hoping that other mod_perl developers have dealt with the same
thing and have good solutions :)

I've found that strings collected while processing XML using XML::Parser do
not play nice with the HTML::Entities module.  Here's the sample program
illustrating the problem:

#!/usr/bin/perl -w

use strict;

use HTML::Entities;
use XML::Parser;

my $buffer;

my $p = XML::Parser-new(Handlers = { Char  = \xml_char });

my $xml = '?xml version=1.0 encoding=iso-8859-1?test' .
  chr(0xE9) . '/test';

$p-parse($xml);

print encode_entities($buffer), \n;

sub xml_char
{
  my($expat, $string) = _;
  
  $buffer .= $string;
}

The output unfortunately looks like this:

Atilde;copy;

Which makes very little sense, since the correct entity for 0xE9 is:

eacute;

My current work-around is to run the buffer through a (lossy!?) pack/unpack
cycle:

my $buffer2 = pack(C*, unpack(U*, $buffer));
print encode_entities($buffer2), \n;

This works and prints:

eacute;

I hope it is not lossy when using iso-8859-1 encoding, but I'm guessing it
will maul UTF-8 or UTF-16.  This seems like quite an evil hack.

So, what is the Right Thing to do here?  Which module, if any, is at fault?
Is there some combination of Perl Unicode-related use statements that will
help me here?  Has anyone else run into this problem?

-John




Re: HTML::Entities chokes on XML::Parser strings

2002-05-07 Thread Paul Lindner

The output from your example looks like UTF-8 data (Atilde; is a
commonly seen UTF-8 escape sequence).  XML::Parser converts all
incoming text into UTF-8.  You will need to convert it back to
iso-8859-1.

My favorite is Text::Iconv

 use Text::Iconv;
 $utf8tolatin1 = Text::Iconv-new(UTF-8, ISO8859-1);

 my $buffer_latin1 = $converter-convert($buffer);


On Tue, May 07, 2002 at 10:51:10AM -0400, John Siracusa wrote:
 I ran into this problem during mod_perl development, and I'm posting it to
 this list hoping that other mod_perl developers have dealt with the same
 thing and have good solutions :)
 
 I've found that strings collected while processing XML using XML::Parser do
 not play nice with the HTML::Entities module.  Here's the sample program
 illustrating the problem:
 
 #!/usr/bin/perl -w
 
 use strict;
 
 use HTML::Entities;
 use XML::Parser;
 
 my $buffer;
 
 my $p = XML::Parser-new(Handlers = { Char  = \xml_char });
 
 my $xml = '?xml version=1.0 encoding=iso-8859-1?test' .
   chr(0xE9) . '/test';
 
 $p-parse($xml);
 
 print encode_entities($buffer), \n;
 
 sub xml_char
 {
   my($expat, $string) = @_;
   
   $buffer .= $string;
 }
 
 The output unfortunately looks like this:
 
 Atilde;copy;
 
 Which makes very little sense, since the correct entity for 0xE9 is:
 
 eacute;
 
 My current work-around is to run the buffer through a (lossy!?) pack/unpack
 cycle:
 
 my $buffer2 = pack(C*, unpack(U*, $buffer));
 print encode_entities($buffer2), \n;
 
 This works and prints:
 
 eacute;
 
 I hope it is not lossy when using iso-8859-1 encoding, but I'm guessing it
 will maul UTF-8 or UTF-16.  This seems like quite an evil hack.
 
 So, what is the Right Thing to do here?  Which module, if any, is at fault?
 Is there some combination of Perl Unicode-related use statements that will
 help me here?  Has anyone else run into this problem?
 
 -John

-- 
Paul Lindner[EMAIL PROTECTED]   | | | | |  |  |  |   |   |

mod_perl Developer's Cookbook   http://www.modperlcookbook.org/
 Human Rights Declaration   http://www.unhchr.ch/udhr/



Re: HTML::Entities chokes on XML::Parser strings

2002-05-07 Thread Rafael Garcia-Suarez

John Siracusa wrote:
 I ran into this problem during mod_perl development, and I'm posting it to
 this list hoping that other mod_perl developers have dealt with the same
 thing and have good solutions :)

I did ;-)

 I've found that strings collected while processing XML using XML::Parser do
 not play nice with the HTML::Entities module.  Here's the sample program
 illustrating the problem:
 
 #!/usr/bin/perl -w
 
 use strict;
 
 use HTML::Entities;
 use XML::Parser;
 
 my $buffer;
 
 my $p = XML::Parser-new(Handlers = { Char  = \xml_char });
 
 my $xml = '?xml version=1.0 encoding=iso-8859-1?test' .
   chr(0xE9) . '/test';
 
 $p-parse($xml);
 
 print encode_entities($buffer), \n;
 
 sub xml_char
 {
   my($expat, $string) = _;
   
   $buffer .= $string;
 }
 
 The output unfortunately looks like this:
 
 Atilde;copy;
 
 Which makes very little sense, since the correct entity for 0xE9 is:
 
 eacute;

That's an XML::Parser issue.
XML::Parser gives UTF-8 to your Char handler, as specified in the manpage :
Whatever the encoding of the string in the original document,
this is given to the handler in UTF-8.

The workaround I used is to write the handler like this :

sub xml_char
{
   my ($expat) = _;
   $buffer .= $expat-original_string;
}

Reading the original string, no need to convert UTF-8 back to iso-8859-1.

 My current work-around is to run the buffer through a (lossy!?) pack/unpack
 cycle:
 
 my $buffer2 = pack(C*, unpack(U*, $buffer));
 print encode_entities($buffer2), \n;
 
 This works and prints:
 
 eacute;
 
 I hope it is not lossy when using iso-8859-1 encoding, but I'm guessing it
 will maul UTF-8 or UTF-16.  This seems like quite an evil hack.
 
 So, what is the Right Thing to do here?  Which module, if any, is at fault?
 Is there some combination of Perl Unicode-related use statements that will
 help me here?  Has anyone else run into this problem?
 
 -John
 



-- 
Rafael Garcia-Suarez




Re: HTML::Entities chokes on XML::Parser strings

2002-05-07 Thread John Siracusa

On 5/7/02 10:58 AM, Paul Lindner wrote:
 The output from your example looks like UTF-8 data (Atilde; is a
 commonly seen UTF-8 escape sequence).  XML::Parser converts all
 incoming text into UTF-8.  You will need to convert it back to
 iso-8859-1.
 
 My favorite is Text::Iconv
 
use Text::Iconv;
$utf8tolatin1 = Text::Iconv-new(UTF-8, ISO8859-1);
 
my $buffer_latin1 = $converter-convert($buffer);

So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)?  What if
I have actual UTF-8 data?  Won't conversion to ISO8859-1 in service of
HTML::Entities result in data loss?

-John




Re: HTML::Entities chokes on XML::Parser strings

2002-05-07 Thread John Siracusa

On 5/7/02 11:06 AM, Rafael Garcia-Suarez wrote:
 The workaround I used is to write the handler like this :
 
 sub xml_char
 {
  my ($expat) = _;
  $buffer .= $expat-original_string;
 }
 
 Reading the original string, no need to convert UTF-8 back to iso-8859-1.

Doh!  I dunno why I didn't think of that, since I've used that expat method
plenty of times before.  This seems safer than forcing a conversion from
UTF-8 to something else (although the other technique is nice to know too :)

-John




Re: HTML::Entities chokes on XML::Parser strings

2002-05-07 Thread Gisle Aas

John Siracusa [EMAIL PROTECTED] writes:

 On 5/7/02 10:58 AM, Paul Lindner wrote:
  The output from your example looks like UTF-8 data (Atilde; is a
  commonly seen UTF-8 escape sequence).  XML::Parser converts all
  incoming text into UTF-8.  You will need to convert it back to
  iso-8859-1.
  
  My favorite is Text::Iconv
  
 use Text::Iconv;
 $utf8tolatin1 = Text::Iconv-new(UTF-8, ISO8859-1);
  
 my $buffer_latin1 = $converter-convert($buffer);
 
 So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)?

Not true.  But the unicode support in perl-5.6.x has many bugs.  With
5.8 things will be better.  It is a bad idea for XML::Parser to give
out strings with the UTF8 flag set.

Regards,
Gisle



Re: HTML::Entities chokes on XML::Parser strings

2002-05-07 Thread John Siracusa

On 5/7/02 11:25 AM, Gisle Aas wrote:
 John Siracusa [EMAIL PROTECTED] writes:
 On 5/7/02 10:58 AM, Paul Lindner wrote:
 The output from your example looks like UTF-8 data (Atilde; is a
 commonly seen UTF-8 escape sequence).  XML::Parser converts all
 incoming text into UTF-8.  You will need to convert it back to
 iso-8859-1.
 
 My favorite is Text::Iconv
 
use Text::Iconv;
$utf8tolatin1 = Text::Iconv-new(UTF-8, ISO8859-1);
 
my $buffer_latin1 = $converter-convert($buffer);
 
 So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)?
 
 Not true.  But the unicode support in perl-5.6.x has many bugs.  With
 5.8 things will be better.  It is a bad idea for XML::Parser to give
 out strings with the UTF8 flag set.

Well, I'll let your guys figure it out (all fixed in 5.8, right? :)  In the
meantime, I guess I'll stick with the workaround(s) posted... :)

-John




Re: HTML::Entities chokes on XML::Parser strings

2002-05-07 Thread Paul Lindner

On Tue, May 07, 2002 at 11:13:43AM -0400, John Siracusa wrote:
 On 5/7/02 10:58 AM, Paul Lindner wrote:
  The output from your example looks like UTF-8 data (Atilde; is a
  commonly seen UTF-8 escape sequence).  XML::Parser converts all
  incoming text into UTF-8.  You will need to convert it back to
  iso-8859-1.
  
  My favorite is Text::Iconv
  
 use Text::Iconv;
 $utf8tolatin1 = Text::Iconv-new(UTF-8, ISO8859-1);
  
 my $buffer_latin1 = $converter-convert($buffer);
 
 So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)?  What if
 I have actual UTF-8 data?  Won't conversion to ISO8859-1 in service of
 HTML::Entities result in data loss?

Yes, HTML::Entities is based on ISO8859-1 input only.  BTW, for better
performance in mod_perl consider using Apache::Util::escape_html()


 escape_html
   This routine replaces unsafe characters in $string
   with their entity representation.

my $esc = Apache::Util::escape_html($html);


Anyway, back to character entities..

Text::Iconv will fail if you try to convert unconvertable text, so at
least you can test for that condition (and adjust accordingly)

BasisTech sells a comprehensive unicode library called Rosette that
knows how to automatically convert to a target character set while
incorporating SGML entities for any character set.  Perhaps it's time
for an open implementation of that..

Also see http://rf.net/~james/perli18n.html for a perl i18n faq.




-- 
Paul Lindner[EMAIL PROTECTED]   | | | | |  |  |  |   |   |

mod_perl Developer's Cookbook   http://www.modperlcookbook.org/
 Human Rights Declaration   http://www.unhchr.ch/udhr/