Subject: libhtml-format-perl: Problem with UTF8
Package: libhtml-format-perl
Version: 2.04-1
Severity: important
*** Please type your report below this line ***
I tried to get the text content of an UTF8 encoded HTML page.
with the following code:
<<
require HTML::TreeBuilder;
$tree = HTML::TreeBuilder->new->parse_file("test.html");
require HTML::FormatText;
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);
print $formatter->format($tree);
>>
A lots of characters with accents were destroyed during this text manipulation.
The following line is in cause:
l. 191: $text =~ tr/\xA0\xAD/ /d;
The bug was already reported here one year ago:
http://rt.cpan.org/Public/Bug/Display.html?id=9700
But the code is always buggy.
Consequently, this package can not be used with multibyte charsets.
-- System Information:
Debian Release: testing/unstable
APT prefers unstable
APT policy: (500, 'unstable')
Architecture: i386 (i686)
Shell: /bin/sh linked to /bin/bash
Kernel: Linux 2.6.9-2-686
Locale: LANG=de_DE, LC_CTYPE=de_DE (charmap=ISO-8859-1)
Versions of packages libhtml-format-perl depends on:
ii libfont-afm-perl 1.19-1 Font::AFM - Interface to Adobe Fon
ii libhtml-tree-perl
3.19.01-2 represent and create HTML syntax t
ii perl 5.8.8-6 Larry Wall's Practical Extraction
libhtml-format-perl recommends no packages.
-- no debconf information