I thought I might draw attention to my use of Xerces1 under this topic.
I wrote an RDF parser with a coroutine architecture.
The principle goals are ability to easily change the RDF grammar, and
ability to understand the RDF grammar. Efficiency is a non-objective.
Basic design
XML-doc
==> Xerces
==> SAX
==> SAX events
==> keyword recogniser
==> "infoset events"
==> JavaCC RDF parser
Upto the SAX events this is normal.
The sax events are mapped in the following fashion to a finer grained event
set:
endElement ==> E_END
characters ==> CD_STRING
startElement ==> some-token for the startTag,
pairs of tokens for each attribute value pair.
The startTag token and the attribute tokens are both subject to "keyword
recognition". i.e. if the tag or attribute is particular significant in the
RDF grammar (e.g. rdf:RDF) then a special token is created, or alternatively
a generic E_OTHER is output.
The implementation of the dataflow is using Java threads, hugely
inefficient, but achieves my design goals. The tokens produced by the
keyword recogniser are stuffed into a pipe (like Doug Lea's BoundedBuffer).
The JavaCC RDF parser then pulls tokens out of the pipe.
The JavaCC definition file.
http://www-uk.hpl.hp.com/people/jjc/arp/arp-1_0_3/src/com/hp/hpl/jena/rdf/ar
p/rdf.jj
Features/bugs of this design:
+ The tokens are smallest items out of infoset.
+ This generates attribute ordering issues (my grammar and keyword
recogniser have an agreed order, application defined, to handle attributes)
+ Extending to full infoset would mean that the application should identify
those parts of infoset which should appear in the token stream, and those
parts that aren't interesting; e.g. in my case comments are discarded.
I have done some experiments with an LALR(1) parser rather than the JavaCC
LL(1) parser. Inverting that produces a very significant speed up (the
thread overhead is huge). I would expect that inverting the XML parsing
(i.e. pull parsing) would also produce such a speed up.
My parser page is:
http://www-uk.hpl.hp.com/people/jjc/arp
Jeremy Carroll
HP Labs
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]