Hi,

i searched through the archive for encoding related threads and found this:

********************
Hello,
 
I'm using htmlParseDoc to build a tree for an ISO-8859-7 (Greek) html file.
In most case I didn't have a problem but in other cases I discovered that
when I try to later save the tree to file or memory it's not being dumped fully
or in the right encoding. I tried to track down the problem and why it only
happens sometimes.
 
I discovered that the problem probably happens in htmlParser.c in the
htmlCurrentChar function and only when the html content has some encoded
characters BEFORE the "Content-Type" meta tag (such as in the "title" tag)
What happens next is that since the parser doesn't know the right encoding yet,
it assumes that it's isolatin1 and tries to convert the rest of the encoded 
characters.
 
Is there any simple workaround when I don't know the correct encoding before 
parsing
the document? Something like trying to find the "Content-Type" meta tag before
parsing the rest of the document or something similar to resolve this issue?
 
As an example I supply two links from the same site which demonstrate the 
problem,
the site was selected only for demonstration purposes
1) http://www.m-art.gr/   - the title has iso-8859-7 encoded characters and the 
document
    doesn't get parsed properly.
2) http://www.m-art.gr/gr/bazart/index.asp - also iso-8859-7 document but no 
encoded characters
    before the "content-type" declaration, this gets parsed properly
 
Thank you very much
Liron
****************************

I have the same problem when parsing an utf-8 html document, where a meta 
description containing 2byte chars, comes in front of the meta encoding tag. 
The whole utf-8 document is converted to utf-8 so 
Mapa článků
becomes something like
Mapa článků

When i parse the document a second time using htmlGetMetaEnconding everything 
is fine.

Is there any solution to detect the encoding before parsing the whole document?

Regards
Hannes
-- 
Psssst! Schon vom neuen GMX MultiMessenger gehört? Der kann`s mit allen: 
http://www.gmx.net/de/go/multimessenger01
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Reply via email to