[Xerces2] design motivations

neilg Mon, 20 Aug 2001 11:48:45 -0700

Hi folks,

It's interesting how easy it is to lose sight of the forest when
you're surrounded by many trees.  We've been concentrating hard on the
Xerces2 schema parsing redesign, grammar resolution design and
validation for so long that it only very recently occurred to us that
we might not have shared the motivations for these designs with the
rest of the community.  This lack of understanding of what it is that
we're trying to accomplish may explain the fact that contributions to
the discussion have been limited so far to only a rather small number
of people.

To understand where we're coming from, a few things should be noted
about Xerces1.  By contrast with the Xerces2 emphasis on pipelines and
modular design, no doubt everybody's aware that Xerces1's design is
rather monolithic--all its parts are tightly coupled and there are a
lot of nasty interdependencies.  One of the big motivations of the
work we're trying to do--as has been true for all the other areas of
Xerces2--is to make the design as modular as possible, so that people
can reuse whatever portions of the parser they find useful in their
applications.

The Xerces1 design had other drawbacks, especially when it comes to
schemas.  The whole Xerces1 grammar structure was based on DTD's, and
schema support was more or less grafted on to it; that is to say
schema grammars sort of look like overgrown DTD's in the Xerces1
world.  For a while one could argue that even if this wasn't elegant,
at least it promoted code reuse.  But as time went by and the Xerces1
support of schemas matured, it became more apparent to us that schemas
really are something of their own beast, and that treating them like
DTD's simply made for really hacked-up, unmaintainable code.
So while in Xerces2 we want to reuse code if it looks feasible, this
isn't our primary goal.

Xerces1 looked at the importing or including of one schema by another
much in the way that it viewed the referencing of an external entity by
an internal DTD subset.  The consequence of this is that Xerces1 has
some limitations with respect to schemas mutually referencing one
another that would have required a heavy amount of redesign to
overcome.  This, along with a desire to break up one class (the
infamous TraverseSchema, currently at version #241!) which had
grown to over 9,000 lines, is the principal motivator behind the
SchemaHandler interface that we've been kicking around.

Xerces2's schema support will also have to eventually include some
means of exposing the PSVIi (post-schema validation infoset).
We're uncertain what form this will take--whether it will be a
DOM API, some kind of output like that produced by Henry Thomson's
XSV, or even an XNI-based API--but, as we're redesigning schema parsing
and building the SchemaGrammar representation, we'll have to take care
to make it sufficiently rich to store all the information necessary to
make this happen.

Another obvious weakness of Xerces1 was the fact that grammars could
not be reused.  So having an infrstructure to provide for a way to
cache grammmars is another very important requirement for Xerces2.  On
the other hand, since grammars will be cachable it makes it somewhat
less important that the conversion of schema documents into grammar
objects be lightning fast.  Thus, while we certainly want our schema
parsing to be efficient, we're quite prepared to sacrifice the odd
shortcut if it makes the design cleaner and more comprehensible.

That said, we're still very much concerned about efficiently
validating instance documents according to grammars.  The Xerces1
validator is another maintenance and efficiency nightmare, largely
because it tries to be "universal"--that is, to validate both
schemas and DTD's.  Here again we're prepared to sacrifice some amount
of shared code to make validators that are more modular and better at
what they do:  thus, we believe that Xerces2 needs specialized DTD and
Schema validators.

Another requirement imposed by DOM level 3--and indeed one for which
we got a lot of requests when Andy circulated a Xerces2 features
survey last January--was DOM tree revalidation.  While this might have
been doable in Xerces1 without too much redesign, we're taking care in
Xerces2 to take this into account from the beginning as we contemplate
how validation should occur.

Down the road, it would certainly be nice if Xerces2 could parse an
XML document into a Xalan-type DTM.  It would also be nice if it could
support validation according to Relax NG grammars.  DOM level 3
contemplates adding XPath support to the DOM, and obviously we'll want
to implement this as well.  All this we have to keep in mind as we're
putting things together, although this last set is further
down the list.  In fact, one of the hardest things we're finding about
this whole process--and perhaps the main reason it's been so slow--is its
scope and the difficulty of developing and keeping a focus on what
needs to be done first.

Having set all this out--and once again asking folks to keep your
wishlists manageable lest the design discussions become even more
bogged down--I'm very curious to know whether this list misses
anything significant?  Are there important things that people
would like to see Xerces2 do that I haven't mentioned?  Do people
agree generally with the stance we're taking, or would you like us to
place emphasis differently--on faster schema processing vs. a focus on
good code, for instance?

As always, feedback is more than welcome.  And if people agree with the tack
we're taking, hopefully this posting will make the discussions we're currently
having about the shape of grammar caching or validation more
approachable and easier for people to participate in who are less
familiar with the guts of how things are done.

Cheers,
Neil
Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  416-448-3519, T/L 778-3519
E-mail:  [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
[Xerces2] design motivations

Reply via email to