Re: [xml] Problem with an old SGML

2005-11-08 Thread Liam R E Quin
On Tue, 2005-11-08 at 00:44 +0100, Kail wrote:
 I've a problem with an old SGLM.
 This have many format error, the 2 most annoing are:
 
 1- Have more than 1 element as root child

SGML does not allow this.

  //Start of file
  reuters  /reuters
  reuters  /reuters

As Daniel has suggested, this must be an external entity.
You are missing a main or driver file.

 etc.
 This file is 7 years old, but i need to parse it :(

Maybe use osx to convert it to XML -- it's part of
OpenJade I think these days.

 There is a possibility to parse it without add a node from the start
 of file to the end?
 
 2- There are also some char like #31; that obviusly are not
 recognised and generate errors...there is a way to avoid the errors
 and make the parser recognise  them as TEXT element avoiding the call
 of xmlParseCharRef or make this function don't generate error? (an
 Option i haven't found ^_^)

There should be a SGML Declaration which says which characters
are allowed in that SGML document.  It's often considered to be
part of the SGML DTD.

Typically you give something like osx the SGML declaration, the
DTD file, and the document, all in one stream.

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org www.advogato.org



___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Problem with an old SGML

2005-11-07 Thread Daniel Veillard
On Tue, Nov 08, 2005 at 12:44:19AM +0100, Kail wrote:
 I've a problem with an old SGLM.
 This have many format error, the 2 most annoing are:
 
 1- Have more than 1 element as root child
  //Start of file
  reuters  /reuters
  reuters  /reuters
 etc.
 This file is 7 years old, but i need to parse it :(
 There is a possibility to parse it without add a node from the start
 of file to the end?

  It is not XML.
  Hum, it's not simple, but you can try to use an XML file
which declares that file as an external entity, then make one 
reference to that entity within a top level element in that file

!DOCTYPE doc [
!ENTITY old_content SYSTEM old.sgml
]
docold_content;/doc

  then
  
  xmllint --nooent new.xml  content.xml

 2- There are also some char like #31; that obviusly are not
 recognised and generate errors...there is a way to avoid the errors
 and make the parser recognise  them as TEXT element avoiding the call
 of xmlParseCharRef or make this function don't generate error? (an
 Option i haven't found ^_^)

  Again this is not XML, that can't be parsed as is. You could try
the --recover option of xmllint in addition to --nooent, but you have
no garantee of result, and it will loose data. This is not XML and
can't be expected to be parsed as such. You could try the html parser
too to see what it gives on it

  xmllint --html old.sgml   content.html

and process from there.

Daniel

-- 
Daniel Veillard  | Red Hat http://redhat.com/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml