I want to expand more on the DOM revalidation mechanism, since I did have some open issues and I hope we can resolve those together.
General concept ------------------- The main class for DOM revalidation is called DOMNormalizer. It acts as both: scanner and document handler. As a scanner, the DOMNormalizer walks the document, performs Text node normalization, fixing namespaces, etc., and sends the appropriate XNI events to a validator. As a document handler it receives: a) default attributes: the DOM attributes exposed via AttributeProxy (that implements XMLAttributes) and allows to add new attributes to the tree. b) PSVI augmentations (not implemented yet): to retrieve default element content and schema normalized data. The DOMNormalizer should have a augmentations object and pass it to a validator (using XNI calls). The validator will update that object with correct PSVI information. The above can be done using XMLDocumentHandler calls. However, there are 2 pieces of information that are not easy to pass using the defined interfaces: a) documentURI - the XMLSchemaValidator should be able to resolve location of the schema documents supplied in for xsi:schemaLocation attributes. The baseURI relative to which the information should be resolved must be provided. b) characters - the XNI calls use XMLString. Using the same structure while revalidating document in memory is a pain since we will need to copy characters all the time. To solve those problems I've added a new interface called xerces.util.RevalidationHandler that extends XMLDocumentHandler and adds 2 methods: a) setBaseURI() b) characterData (String data, augmentations) (to send characters()) All validators implement this interface. Any validator must be reset before usage. The reset() methods take XMLComponentManger. To avoid modifying the code in each of the validators and to have a common holder for features and properties needed during DOM revalidation, I've added a specialized DOMValidationConfiguration as a field on the Document. The configurations stores the following data: -- features related to validation -- symbol table -- error handler -- XMLEntityManager (see open issues) -- Message formatters etc. The pros for this approach is that all information is stored in one place and passed to a validator during reset. The cons are that we carry one extra object on the document (Note: the object is create only in the case user calls normalizeDocument with "validation" feature turned on). Open issues ------------ (1) RevalidationHandler should include additional methods. After the PSVI is exposed via DOM, element declarations will be available in the DOM tree for unmodified elements. RevalidationHandler should be able to pass declarations for unmodified elements to the XMLSchemaValidator, to save declaration lookup time and avoid unnecessary validation checks. Another reason why we need to pass element declaration is XPath 2.0 validate() function: it also allows to specify declarations against which validation should occur. After we polish the RevalidationHandler interface I think it should be moved to XNI package. (2) Copy the grammar object from parser to Document. If document was preparsed using Xerces parser, it would be nice to be able to copy the grammar that was used to validate a document to the DOM tree. The problem I was having is the ability to retrieve from XMLGrammarPool one grammar identified by namespace (XML Schema) or root name (DTD). If we store the grammar in the document, revalidation becomes faster. In addition, we can make sure that the grammar used for revalidation is exactly the same (with DTD grammar preloading we have a problem with internal subsets). On the other hand, Xerces DOM will use more memory. (3) XMLEntityManager - seems too big for the purpose of revalidation. All we need is the ability to expand system identifier based on the baseURI. Any ideas? (4) Symbol Table Document creates a shadow symbol table (that is used in revalidation). The shadow table has a link to the parser symbol table. Again the problem is memory usage, especially if symbol table grows. Any ideas on how to handle it differently? Threads -------------------- The DOMImplementations stores a reference to only one validator (which is created if revalidation is needed -- thus the first call to revalidate document is _very_ expensive). The normalizeDocument methods calls synchronized getValidator and releaseValidator() methods from the DOMImplementation before. This means that if there are multiple threads running, threads will need to wait till this validator is released. We consider implementing a pool of validators on DOMImplementation. Possible optimizations ---------------------- a) If users calls normalizeDocument() twice in a row with no modifications and the same set of features we need not traverse the tree, but we do now. b) The DOMNormalizer already stores all information needed for the namespace binding. However, given the current code in XMLSchemaValidator we have to also pass start/endPrefixMapping. Well, this is it for now... :) -- Elena Litani / IBM Toronto --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
