Re: [xml] utf-8 encoding and xmlSAXParseMemory

Daniel Veillard Tue, 02 May 2006 14:09:11 -0700

On Tue, May 02, 2006 at 07:15:07PM +0200, A. Pagaltzis wrote:
> * Olivier Sirven <[EMAIL PROTECTED]> [2006-05-02 18:35]:
> > If you have a solution for correcting every invalid character
> > into a valid one without loosing information I would be really
> > happy to read it :)
> 
> Well, not in the general case; the computer is not a mind reader.
> But depending on the assumptions you can make, you can do
> something like what I wrote about here:
> 
>     Repairing broken documents that mix UTF-8 and ISO-8859-1
>     http://plasmasturm.org/log/416/


  The problem is "how do you know it's ISO-8859-1 and not another variant.
You can't garantee to not generate false positive (i.e. corrupt data) which
is why the XML Working Group declared this had to be a fatal error. The
only sane approach (in those days of liability for software this is
especially true) is to force the error to get the input fixed, unless you
have some information which tells you what the encoding really is and
then you can still preprocess.

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] utf-8 encoding and xmlSAXParseMemory

Reply via email to