On Tue, Nov 08, 2005 at 12:44:19AM +0100, Kail wrote:
I've a problem with an old SGLM.
This have many format error, the 2 most annoing are:
1- Have more than 1 element as root child
//Start of file
reuters /reuters
reuters /reuters
etc.
This file is 7 years old, but i need to parse it :(
There is a possibility to parse it without add a node from the start
of file to the end?
It is not XML.
Hum, it's not simple, but you can try to use an XML file
which declares that file as an external entity, then make one
reference to that entity within a top level element in that file
!DOCTYPE doc [
!ENTITY old_content SYSTEM old.sgml
]
docold_content;/doc
then
xmllint --nooent new.xml content.xml
2- There are also some char like #31; that obviusly are not
recognised and generate errors...there is a way to avoid the errors
and make the parser recognise them as TEXT element avoiding the call
of xmlParseCharRef or make this function don't generate error? (an
Option i haven't found ^_^)
Again this is not XML, that can't be parsed as is. You could try
the --recover option of xmllint in addition to --nooent, but you have
no garantee of result, and it will loose data. This is not XML and
can't be expected to be parsed as such. You could try the html parser
too to see what it gives on it
xmllint --html old.sgml content.html
and process from there.
Daniel
--
Daniel Veillard | Red Hat http://redhat.com/
[EMAIL PROTECTED] | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
___
xml mailing list, project page http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml