RE: [Xerces2] How do we want Grammar Caching

Arnold, Curt 21 Aug 2001 22:25:44 -0000

> > However, since A can import B and B import A and neither 
> namespace can 
> > be
> completely defined without having some elements from the 
> other namespace present, I don't believe that you can cleanly 
> separate an arbitrary schema into isolated namespace grammars.
> 
> True. But we can't ignore the possibility that A imports B 
> and the application wants to cache only B. So we can't put 
> declarations from different namespaces in one big grammar. 
> They have to be separated by namespace, and in different 
> Grammar objects. And in each of such objects, there is a list 
> of other grammars that it depends on. This serves both
> purposes:
> - Inter-dependent grammars are always together;
> - Independent grammar can be moved around (cached and 
> assigned to other document instance).
>


I think that if namespaces are intertwined then trying to separate them is very 
complex and it just seems to cause a lot of edge cases.

> > The GrammarResolver needs a mechanism to say that I don't have a 
> > grammar
> but I don't want you to use schemaLocation either (since it 
> might be DoS attack).
> 
> The application can override the location hint, by providing 
> a new location. But I didn't think about the case that the 
> application doesn't want to resolve this grammar at all. 
> Should we provide such flexibility? And how?

I think it is essential.  I'd suggest returning null from resolveGrammar and 
having the parser take that as a definitive answer.

> 
> > So if you have namespace A which imports HTML Strict and namespace B
> which imports HTML Frameset, all the schema definitions could 
> coincide without conflict.  If A imports C and B imports C', 
> then the complementary C' definitions could be added to the 
> global element and type map and inconsistent C' definitions 
> could still be accessed through the content model of a B element.
> 
> If the intersection of C and C' (global-declared components) 
> is empty, then it's solvable: we can combine the declarations 
> in the two grammars into one grammar (at least conceptually). But
> - checking whether the intersection is empty needs time (maybe a lot);
> - if the intersection is empty, why not use a new C to 
> include both C and C' at the first place?

The A and B grammars may have been loaded independently with different 
respected xsi:schemaLocations and the only time the conflict is detected is 
when the two namespaces are used together.  You can
prevent only the conflict if you are involved when the schemas are loaded.

> But if the intersection is not empty, it would definitely 
> cause trouble, because sometimes content model is not enough, 
> and an instance can refer to some global declaration directly 
> (wildcards and xsi:type).

I think the trouble can be avoided (or at least resolved within the leeway 
provided by the schema spec by treating one of the conflicting definitions as 
dominant).

> So I'd like to see the parser has a 
> "one-grammar-per-namespace" rule for each instance.

To avoid ever getting two distinct grammars for an namespace into a document, 
you really have to extend this to say that a grammar resolver only has one 
grammar for every namespace that imported by
any of its known grammars.


> > I think loading grammar by exception is cleaner and this method in
> unnecessary.  Say if you have 100 grammars in your pool, do 
> you want to copy all 100 into the document if  you are only 
> going to use three.
> 
> By "exception", do you mean through "resolveGrammar()"?

Yes 

> I can see this type of applications: it knows that all the 
> instances it parses use the same 5 grammars. So it wants to 
> set these 5 grammars to the parser at once, and don't care 
> about any further grammar resolution issues. That is, it 
> implements "getInitialGrammarSet()", and ignore all other methods.

If it has a HashMap then the body of resolveGrammer is simply

Grammar resolveGrammar(String nsURI...) {
        return map.get(nsURI);
}

Seems simpler just to implement this one method.

> Well, it's not always true. For example, if a grammar is 
> resolve from a disk file, then it's only stored in a list 
> local to the parser. "returnFinalGrammarSet()" is the only 
> chance to send it back to the application.

resolveGrammar should be in control if a grammar is loaded from a disk file.  
We have probably been missing a parameter like GrammarResolutionContext that 
provides services to resolveGrammar.  So a
resolveGrammar call could look like:

Grammar resolveGrammar(String nsURI,..., GrammarResolutionContext context) {
        Grammar grammar = context.loadSchema(schemaLocationHint);
        myset.add(grammar);
        return grammar;
}

If you do something like:

Grammar resolveGrammar(String nsURI,..., GrammarResolutionContext context) {
        Grammar grammar = context.loadDTD(publicID, systemID);
        myset.add(grammar);
        return grammar;
}

Then the call to context.loadDTD might trigger the traditional call to 
entityResolver.

> >   // ask the application to provide a grammar for
> >   // the given namespace.
> >   public Grammar resolveGrammar(String grammarType,
> >                                 String grammarKey);
> 
> > Is grammar type a requirement or a hint?
> 
> In Xerces2, there are different validators and grammar 
> parsers for different kinds of grammars (DTD, schema, ...), 
> so we always know which type of grammar we are expecting, 
> hence it's always possible to provide such grammarType parameter.
> 
> Yes it does. Since different type of grammars use different 
> validators, when the parser asks for a schema grammar, it has 
> to get a schema grammar.

Instead of type being a String, would a Class be more appropriate?  Since the 
DTDValidator is really expecting an object that implements DTDGrammar and not 
just any instance of Grammar.

> 
> > Is GrammarKey the Namespace URI or something else?
> 
> The key has different meaning for different type of grammar. 
> For schema, it's the namespace; for external DTD, it's a file 
> location; for internal DTD, it's the root element.

For external DTD you may have two distinct keys, publidID and systemID.  For 
schema, either the namespace and schemaLocation hint (which could share the 
same parameter as systemId) might be used to
resolve the grammar.

Do you want the resolver to be able to override an internal subset?

> > Qname can't just be "someprefix:sometag" since you have to have some
> mechanism to get the namespace URI.
> 
> In Xerces, QName reflects all the information: prefix, 
> resolved namespace uri, localpart, rawname.

Okay.

> 
> > I would have done one resolveGrammar call something like:
> >
> >   public Grammar resolveGrammar(String namespaceURI,
> >                                 String publicId,
> >                                 String schemaLocationHintOrSystemID,
> >                                 String 
> absolutizedSchemaLocationHint,
> >                                 String localName,
> >                                 short nodeType);
> 
> My concerns about this method:
> 
> 1. The need for grammarType. Many applications need to do 
> both DTD and schema validation.

I understand the binding between validator and grammar now.  So add I'd add a 
Class grammarType parameter.


> 2. The method is only for namespace-aware and namespace-keyed 
> grammars. How about DTD? How about a future grammar that 
> doesn't use namespace as key?

If namespaceAware() = false or there is no namespace declared for the document 
element, namespaceURI would be null.  If there is a default namespace, a DTD 
resolver could just ignore it and resolve
the DTD based on publicID or systemID.

There really seems to be a multiplicity of things that any particular resolver 
might want to use to resolve the grammar.

A DTD resolver could use the namespace, public id, system id or element name to 
determine the appropriate grammar.

A Schema resolver could use the namespace, system id (aka in scope 
schemaLocation hint), namespace (or null) and element or attribute name.


> 2. The method mixes the two cases:
>   a. <... xsi:schemaLocatin="namespaceURI, locationHint"/>
>   b. <nsprefix:locapart .../>
> You can argue that if localName==null, then it's the first 
> case. But an interface is meant to be clean.

I think these are still clean.  An xsi:schemaLocation for a namespace would not 
trigger a call to resolveGrammar, only the use of an unrecognized namespace.  
So if I had:

<a:foo xmlns:b="http://www.example.org/bar"; 
xsi:schemaLocation="http://www.example.org/bar bar.xsd">
        <a:description/>
        <a:whatever>
        <b:bar/>
</a:foo>

resolveGrammar would be triggered when processing <b:bar>, however the in-scope 
schema location hint would be a useful piece of information.

> 4. You are expecting GrammarResolver to actually parse a 
> grammar document. Consider the case of mutually-importing: 
> SchemaHandler is trying to solve grammar A, and A imports B, 
> so SchemaHandler calls GrammarResolver for B. GrammarResolver 
> then decides to parse B, and call SchemaHandler to do so. 
> Then SchemaHandler sees that B imports A, and ask for A from 
> the GrammarResolver. An infinite loop! My idea is that 
> GrammarResolver never tries to parse a grammar document. When 
> a method of GrammarResolver is called, we are already in a 
> validator or a grammar parser, which knows how to parse 
> grammars. So there is no need for GrammarResolver to care 
> about it. GrammarResolver is responsible for
> - return a grammar if that grammar is already parsed and cached
> - override a grammar location by returning an InputSource
> And we definitely need two methods for the two cases. This is 
> why I have resolveGrammar and resolveGrammarLocation. Having 
> a location hint as a parameter of resolveGrammar would lead 
> the application to think that it's responsible of grammar parsing.

That seems to be the place where the previously mentioned 
GrammarResolutionContext would come in.  I think the GrammarResolver should be 
in full control, but it can depend on services provided by the
current parse context.

> > And if Grammar = null is returned, then the parser does not 
> attempt to
> load the schemaLocationHint.  The simplest implementation 
> would just ignore all parameters but the namespaceURI.
> 
> Returning null only means that a grammar for this grammarKey 
> is not cached. The parser then needs to resolve it from the 
> location, and the application can override that location.

Again, resolving based on xsi:schemaLocation or a DTD's system ID should 
typically be inhibited except in desktop document development.


> Well, we certainly can deal with this case. But the 
> application might get some surprising an unexplainable errors 
> (as I mentioned, wildcards and xsi:type). And I don't see why 
> an application would want an instance to be validated against 
> two different grammars with the same namespace at the same time.

Partitioning namespaces between schemas seems to be a pretty useful thing and 
by the time you realize that you are using namespaces that imported different 
partitions of the same namespace, it may be
too late.  If you ever resolve using xsi:schemaLocation, you have this 
possibility.

> This "grammarConflict()" method has other uses:
> - the application returns a grammar of one type (DTD)  while 
> the parser is asking for another type (schema).
> - the application returns a grammar for one namespace while 
> the parser is asking for another namespace.

Those seem like the validation should throw an exception (or act as 
GrammarResolver returned null).

> 
> What's your opinion of whether to derive GrammarResolver from 
> EntityResolver, or whether to remove resolveGrammarLocation() 
> methods and use EntityResolver instead?
> 

I think they should be distinct.  The current EntityResolver would come into 
play when a GrammarResolver called context.loadDTD(publicID,systemID) or 
something similar.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [Xerces2] How do we want Grammar Caching

Reply via email to