Subject: libhtml-format-perl: Problem with UTF8
Package: libhtml-format-perl
Version: 2.04-1
Severity: important

*** Please type your report below this line ***
I tried to get the text content of an UTF8 encoded HTML page.

with the following code:

<<
require HTML::TreeBuilder;
$tree = HTML::TreeBuilder->new->parse_file("test.html");

require HTML::FormatText;
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);
print $formatter->format($tree);
>>

A lots of characters with accents were destroyed during this text manipulation.

The following line is in cause:
l. 191:   $text =~ tr/\xA0\xAD/ /d;

The bug was already reported here one year ago:
http://rt.cpan.org/Public/Bug/Display.html?id=9700

But the code is always buggy.

Consequently, this package can not be used with multibyte charsets.

-- System Information:
Debian Release: testing/unstable
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: i386 (i686)
Shell:  /bin/sh linked to /bin/bash
Kernel: Linux 2.6.9-2-686
Locale: LANG=de_DE, LC_CTYPE=de_DE (charmap=ISO-8859-1)

Versions of packages libhtml-format-perl depends on:
ii  libfont-afm-perl              1.19-1     Font::AFM - Interface to Adobe Fon
ii  libhtml-tree-perl             3.19.01-2  represent and create HTML syntax t
ii  perl                          5.8.8-6    Larry Wall's Practical Extraction

libhtml-format-perl recommends no packages.

-- no debconf information

Reply via email to