RE: [Xerces2] How do we want Grammar Caching

Arnold, Curt 20 Aug 2001 22:23:13 -0000

> So my take is that the output should be a set (array) of 
> "grammar"s. They serve the same purpose as a big "schema", 
> but it's easy to take just one of them and reuse it.


However, since A can import B and B import A and neither namespace can be 
completely defined without having some elements from the other namespace 
present, I don't believe that you can cleanly
separate an arbitrary schema into isolated namespace grammars.

> 
> 2. One/many schema grammar(s) for a namespace (or no-namespace)
> 
> While applications want extreme flexibility from the parser, 
> should we extend such flexibility to a degree that for a 
> given namespace, there could be more than one corresponding 
> grammars? Though the spec doesn't forbid this, but it allows 
> (and to some degree, encourage) the parser to have a 
> one-grammar-per-namespace rule.

The parser shouldn't make this decision.  For instance, if you have two partial 
schemas for different aspects of a namespace, for example HTML frameset and 
HTML strict.  If one parse instance detects
due to the presence of the <frame> element that the frameset partial schema is 
appropriate, it should no have any affect on another parse that has detected 
the <body> element.
> 
> For example, a grammar for namespace N is already known to 
> the parser, then it sees an element from N. Does the parser 
> use that known grammar, or does it ask the application for 
> another one? Another example: A imports B and C, then both B 
> and C import D. Does the parser ask the application for D 
> twice (which might result in two different D's, and note that 
> it would be an error if the two D's have element declarations 
> of the same name)?

The parser should know what grammers it has been been resolved for the current 
document, but shouldn't have any memory between documents.  So the second time 
within one document that a particular
namespace is requested, it should use the resource resolved from the first time.

You can have two different element declarations of the same name within  one 
schema document.  You just can't have two with namespace wide scope.  If you 
set a precedence, that if there are two
namespace-scoped definitions, the one in the first namespace loaded wins, then 
there is no need for an error.


> Based on these two points, I'll try to describe what I have 
> seen from the requirements, and how they can be implemented. 
> But it doesn't mean that this will be the implementation in 
> Xerces2. The final design depends on how necessary it is, how 
> difficult it is, and how it conforms to / conflicts with 
> other standards. In fact, lots of ideas below are from the 
> responses of you guys (thanks :-)). I'm just trying to put 
> them together.
> 
> 1. Schema compiling
> 
> If a schema document include/redefine another schema 
> document, the application will be given an opportunity to 
> override the schemaLocation hint.
> 
> If a schema document import a namespace (or no-namespace),
> a. if a grammar for such namespace is already known to the 
> parser, then that grammar is used; 

Again, I feel the parser should have no schema memory between documents that 
any memory be in the implementation of the GrammarResolver.  So the first time 
that a namespace is encountered, a
GrammarResolver call is made.  Any subsequent use of that namespace should not 
result in a GrammarResolver call, you definitely want to avoid the case of 
doing 5000 attempts to resolve a bad schema
location during one parse.

b. otherwise, the 
> application will be given an opportunity to provide the 
> grammar object for that namespace; 

This should be the typical case.

c. if a grammar is not 
> provided from (b), then the application can override the 
> schemaLocation hint.

The GrammarResolver needs a mechanism to say that I don't have a grammar but I 
don't want you to use schemaLocation either (since it might be DoS attack).


> 
> 2. Instance validation against schema
> 
> For an element/attribute from a certain namespace,
> a. if a grammar for such namespace is already known to the 
> parser, then that grammar is used; b. otherwise, the 
> application will be given an opportunity to provide the 
> grammar object for that namespace.
> 
> For an xsi:schemaLocation/noNamespaceSchemaLocation 
> attribute, a. if a grammar for such namespace is already 
> known to the parser, then that grammar is used; b. otherwise, 
> the application will be given an opportunity to provide the 
> grammar object for that namespace; c. if a grammar is not 
> provided from (b), then the application can override the 
> schemaLocation hint.
> 
> 3. DTD compiling
> 4. Instance validation against DTD
> 
> I don't know DTD as much as folks do to come up with a good 
> algorithm. So I need help from you guys again. We might need 
> to consider the following questions. a. How to cache internal 
> DTD subset vs. external DTD subset; b. What's the minimum 
> unit of DTD grammar to cache? An internal subset? An external 
> subset (corresponding to a .DTD file)? Or a group of 
> internal/external subsets? c. Any other questions that block 
> us from a clean and specific design.
> 
> 
> Having nailed down the algorithm for schema validation, we 
> can start to
> *imagine* some implementation details. (Note: this is just 
> *imagination" instead of "decision"; and we need to fill the 
> blanks for DTD.)
> 
> To the parser, there is a channel through which the parser 
> communicates with the application. Naturally, we can have an 
> interface with callback methods to serve as such channel (a 
> name GrammarResolver seems to be acceptable). I've attached 
> an interface prototype of GrammarResolver at the end of the 
> message. Some points that are worth pointing out:
> 
> 1. According to our earlier discussion, this interface works 
> for Schema. There are still open issues:
>  - How does it support caching DTD grammars?
>  - How does it support caching other types of grammars?
>  - Is this interface suficient for caching any kinds of grammars?
> 
> 2. Note that the two "resolveGrammarLocation" methods at the 
> end seem overlapping with "resolveEntity" method of 
> "EntityResolver": they are all used to override an location. 
> And in Xerces1, we did use EntityResolver to override schema 
> location hints. So it causes confusion when both 
> GrammarResolver and EntityResolver are specified: which one 
> to use, or which one takes higher priority? To solve this, we 
> can drop the two "resolveGrammarLocation" methods, and use 
> the EntityResolver all the time, but we are risking losing 
> grammar type and, possibly, namespace information. We 
> currently lean more to another solution: derive 
> GrammarResolver from EntityResolver. So that you only need to 
> set one resolver for such purpose, and we can clearly state 
> which method(s) (the one from EntitiResolver or 
> GrammarResolver) takes precedence. Any comments on either of 
> the two approaches?
> 
> 3. There is still a problem for the design of schema caching. 
> Assume grammar A (that is, a grammar with target namespace A) 
> is known to the parser, then the parser asks the application 
> for grammar B. The application returns grammar B, which 
> imports a different grammar A. Now the two A's conflict (we 
> assumed a one-grammar-per-namespace rule). To avoid such 
> confliction, in the GrammarResolver interface, we ask the 
> application to provide a different grammar B in this case 
> (method grammarConflict()).

There is nothing inherently evil about having two distinct schemas for one 
namespace, you just don't allow subsequently loaded schemas for the same 
namespace replace any previously loaded namespace
scoped elements or types.

So if you have namespace A which imports HTML Strict and namespace B which 
imports HTML Frameset, all the schema definitions could coincide without 
conflict.  If A imports C and B imports C', then the
complementary C' definitions could be added to the global element and type map 
and inconsistent C' definitions could still be accessed through the content 
model of a B element.

> 
> 4. The parser doesn't interact with a grammar pool directly. 
> It's the job of the GrammarResolver to get grammars from the 
> pool, and give them to the parser. So the caching detail is 
> separated from the validation process of the parser, and is 
> under full control of the application.

Definitely my favored approach.

> 
> 5. Xerces2 will provide a default implementation of 
> GrammarResolver, which interacts with a default 
> implementation of grammar pool: a. The pool is shared across 
> the application; b. The pool is thread safe; c. Put every 
> grammar in the pool; d. Always try to get grammars from the 
> pool first.

At least one, but I would try to make GrammerResolver intuitive enough that 
providing you own isn't a huge deal.

> 
> Is this default behavior satisfying? Of course, people can 
> always contribute other general-purpose 
> GrammarResolver/GrammarPool implementations.
> 
> Did I miss anything?
> 
> Cheers,
> Sandy Gao
> Software Developer, IBM Canada
> (1-416) 448-3255
> [EMAIL PROTECTED]
> 
> //***************************************
> public interface GrammarResolver // extends EntityResolver ??
> {
>   // we are trying to make this GrammarResolver
>   // work for all kinds of grammars, so we have
>   // a parameter "gramarType" for each of the
>   // methods: "schema", "dtd:internal",
>   // "dtd:external", etc.
> 
>   // retrieve the initial known set of grammars
>   // this method is called before the validation
>   // starts. the application can provide an
>   // initial set of grammars that's available
>   // to the current validation attempt.
>   // REVISIT: do we make a copy of the returned
>   // hashtable, or do we add grammars directly
>   // into this returned hashtable? If we choose
>   // the latter, the following
>   // returnFinalGrammarSet() is not necessary.
>   public Hashtable getInitialGrammarSet(String grammarType);

I think loading grammar by exception is cleaner and this method in unnecessary. 
 Say if you have 100 grammars in your pool, do you want to copy all 100 into 
the document if  you are only going to use
three.

> 
>   // return the final set of grammars
>   // this method is called after the validation
>   // finishes. the application can then choose
>   // to cache some of the returned grammars.
>   public void returnFinalGrammarSet(String grammarType,
>                                     Hashtable grammars);

Again, I think you had the option to participate in every schema resolution 
event and if you wanted the whole list, you had a chance to assemble it 
yourself.

> 
>   // ask the application to provide a grammar for
>   // the given namespace.
>   public Grammar resolveGrammar(String grammarType,
>                                 String grammarKey);

Is grammar type a requirement or a hint?  If Grammer is an abstract interface, 
does it matter if I return an Grammar the represents an XML Schema or a Grammar 
that represents a Relax NG Grammar.

Is GrammarKey the Namespace URI or something else?

> 
>   // ask the application to provide a grammar for
>   // a certain node: the qname of the node plus the
>   // node type: element or attribute.
>   // I don't like this method, but I can see it's
>   // useful in some cases.
>   public Grammar resolveGrammar(String grammarType,
>                                 QName nodeName,
>                                 int nodeType);

Qname can't just be "someprefix:sometag" since you have to have some mechanism 
to get the namespace URI.  I would have done one resolveGrammar call something 
like:

    public Grammar resolveGrammar(String namespaceURI,
                                            String publicId,
                                  String schemaLocationHintOrSystemID,
                                  String absolutizedSchemaLocationHint,         
                                                                    String 
localName,
                                  short nodeType);

And if Grammar = null is returned, then the parser does not attempt to load the 
schemaLocationHint.  The simplest implementation would just ignore all 
parameters but the namespaceURI.
> 
>   // the grammar or one of the grammars it depends
>   // on (directly or indirectly) conflicts with a
>   // grammar known to the parser.
>   // the application is asked to provide another
>   // grammar for the same target namespace.
>   // the application can choose to return null,
>   // which has the same effect as returning null
>   // from resolveGrammar().
>   public Grammar grammarConflict(String grammarType,
>                                  Grammar grammar);

If you do not let an schema to add a duplicate element to the document-scope 
element and type maps (or alternatively you always check the first loaded 
Grammar in preference to secondary Grammars),
then conflicts aren't a problem.


> 
>   // REVISIT:
>   // the following two methods are used to override
>   // schemaLocation hints. They don't really need to
>   // be here, especially the one for include/redefine.
>   // the application can override the locations by
>   // the old-fashion EntityResolver.
> 
>   // ask the application to override import
>   // schemaLocation.
>   public XMLInputSource resolveGrammarLocation(String grammarType,
>                                                String grammarKey,
>                                                String hint);
>   // ask the application to override include/redefine
>   // schemaLocation.
>   public XMLInputSource resolveGrammarLocation(String grammarType,
>                                                String hint);
> }
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [Xerces2] How do we want Grammar Caching

Reply via email to