DOM revalidation: design/ open issues

Elena Litani Wed, 05 Jun 2002 14:33:40 -0700

I want to expand more on the DOM revalidation mechanism, since I did
have some open issues and I hope we can resolve those together.


General concept
-------------------
The main class for DOM revalidation is called DOMNormalizer. It acts as
both: scanner and document handler.

As a scanner, the DOMNormalizer walks the document, performs Text node
normalization,  fixing namespaces, etc., and sends the appropriate XNI
events to a validator.

As a document handler it receives:
a) default attributes: the DOM attributes exposed via AttributeProxy
(that implements XMLAttributes) and allows to add new attributes to the
tree.
b) PSVI augmentations (not implemented yet):  to retrieve default
element content and schema normalized data. The DOMNormalizer should
have a augmentations object and pass it to a validator (using XNI
calls). The validator will update that object with correct PSVI
information. 

The above can be done using XMLDocumentHandler calls. However, there are
2 pieces of information that are not easy to pass using the defined
interfaces:

a) documentURI - the XMLSchemaValidator should be able to resolve
location of the schema documents supplied in for xsi:schemaLocation
attributes. The baseURI relative to which the information should be
resolved must be provided.

b) characters - the XNI calls use XMLString. Using the same structure
while revalidating document in memory is a pain since we will need to
copy characters all the time. 

To solve those problems I've added a new interface called
xerces.util.RevalidationHandler that extends XMLDocumentHandler and adds
2 methods:
a) setBaseURI()
b) characterData (String data, augmentations) (to send characters())

All validators implement this interface.

Any validator must be reset before usage. The reset() methods take
XMLComponentManger. To avoid modifying the code in each of the
validators and to have a common holder for features and properties
needed during DOM revalidation, I've added a specialized
DOMValidationConfiguration as a field on the Document.
The configurations stores the following data:
-- features related to validation
-- symbol table
-- error handler
-- XMLEntityManager (see open issues)
-- Message formatters 
etc.

The pros for this approach is that all information is stored in one
place and passed to a validator during reset.
The cons are that we carry one extra object on the document (Note: the
object is create only in the case user calls normalizeDocument with
"validation" feature turned on).

Open issues
------------
(1) RevalidationHandler should include additional methods.

After the PSVI is exposed via DOM, element declarations will be
available in the DOM tree for unmodified elements.
RevalidationHandler should be able to pass declarations for unmodified
elements to the XMLSchemaValidator, to save declaration lookup time and
avoid unnecessary validation checks.

Another reason why we need to pass element declaration is XPath 2.0
validate() function: it also allows to specify declarations against
which validation should occur.

After we polish the RevalidationHandler interface I think it should be
moved to XNI package.

(2) Copy the grammar object from parser to Document. 

If document was preparsed using Xerces parser, it would be nice to be
able to copy the grammar that was used to validate a document to the DOM
tree. The problem I was having is the ability to retrieve from
XMLGrammarPool one grammar identified by namespace (XML Schema) or root
name (DTD).

If we store the grammar in the document, revalidation becomes faster. In
addition, we can make sure that the grammar used for revalidation is
exactly the same (with DTD grammar preloading we have a problem with
internal subsets).

On the other hand, Xerces DOM will use more memory.

(3) XMLEntityManager - seems too big for the purpose of revalidation.
All we need is the ability to expand system identifier based on the
baseURI. Any ideas?

(4) Symbol Table
Document creates a shadow symbol table (that is used in revalidation).
The shadow table has a link to the parser symbol table.
Again the problem is memory usage, especially if symbol table grows. Any
ideas on how to handle it differently?


Threads
--------------------
The DOMImplementations stores a reference to only one validator (which
is created if revalidation is needed -- thus the first call to
revalidate document is _very_ expensive).
The normalizeDocument methods calls synchronized getValidator and
releaseValidator() methods from the DOMImplementation before.
This means that if there are multiple threads running, threads will need
to wait till this validator is released. 
We consider implementing a pool of validators on DOMImplementation.


Possible optimizations
----------------------
a) If users calls normalizeDocument() twice in a row with no
modifications and the same set of features we need not traverse the
tree, but we do now. 

b) The DOMNormalizer already stores all information needed for the
namespace binding. However, given the current code in XMLSchemaValidator
we have to also pass start/endPrefixMapping. 



Well, this is it for now... :)

-- 
Elena Litani / IBM Toronto

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

DOM revalidation: design/ open issues

Reply via email to