On Fri, Nov 17, 2006 at 02:54:56PM -0600, R. Steven Rainwater wrote:
> In the process of doing maintenance on the mod_virgule (advogato.org)
> codebase, I've run across what I think is a minor bug in the libxml2
> function, UTF8ToHtml(). I've checked the bug database and didn't see any
> reports on this. I thought it would be a good idea to post some
> background on how I'm using UTF8ToHtml() before filing a bug report in
> case I'm misunderstanding the intention of this function.
>
> Mod_virgule accepts HTML form submissions encoded as UTF-8 data. This
> data often includes HTML markup as well as various international
> characters. The data must be must be processed by older mod_virgule code
> that can handle only ASCII, not UTF-8 data. So the raw UTF-8 data is
> passed through UTF8ToHtml() to convert it to ASCII HTML with entity
> encoding of non-ASCII characters.
>
> This works great when the input is English or most languages that use a
> Latin character set. But when valid HTML markup encoded as UTF-8
> contains more exotic characters, such as Han ideographs, it causes
> UTF8ToHtml() to fail, returning an error code of -2. This is unexpected
> since the input was valid UTF-8.
>
> I examined the code of the UTF8ToHtml() function and discovered that it
> fails with error -2 because the input contains UTF-8 characters for
> which libxml2 does not know a named entity value (e.g. "É").
> Since there are tens of thousdands of possible UTF-8 characters and
> libxml2 only knows names for a couple of hundred, this seems to suggest
> UTF8ToHtml() will fail most of the time if the input includes non-Latin
> character sets.
>
> By making a trivial change to the code in UTF8ToHtml(), I was able to
> correct this behavior. When a named entity value cannot be found in the
> internal libxml2 entity table, a numeric entity value (e.g. "兡")
> is used instead.
>
> Here's the original code where the problem lies:
>
> /*
> * Try to lookup a predefined HTML entity for it
> */
>
> ent = htmlEntityValueLookup(c);
> if (ent == NULL) {
> /* no chance for this in Ascii */
> *outlen = out - outstart;
> *inlen = processed - instart;
> return(-2);
> }
>
> And here's the same piece of my revised code that seems to have fixed
> the problem:
>
> /*
> * Try to lookup a predefined HTML entity for it
> */
>
> ent = htmlEntityValueLookup(c);
> if (ent == NULL) {
> snprintf(nbuf, sizeof(nbuf), "#%u", c);
> cp = nbuf;
> }
>
>
> I can file a bug report and attach a full patch for this if desired.
> Otherwise, maybe somebody can explain where I've gone wrong. Thanks!
No that sounds just right ! Using a character reference (technically
"兡" is not an entity reference :-) is the right thing to do there.
Patch would be gratefully accepted, I will just need to check effect on
libxml2 and libxslt regression tests.
Daniel
--
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard | virtualization library http://libvirt.org/
[EMAIL PROTECTED] | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml