RE: [Xerces2] How do we want Grammar Caching

sandygao 22 Aug 2001 22:44:50 -0000

Hi Curt. I'm so glad that there are other people who have thought about
grammar caching as much as (or more than) I did. :-)


> I think that if namespaces are intertwined then trying to
> separate them is very complex and it just seems to cause a
> lot of edge cases.

In fact, this is what we did in Xerces1: storing components from different
namespace in different Grammar objects. We didn't have too much trouble
with it. In fact, so I don't think this is a problem.

Of course, storing everything in one big schema object might save us some
programming time, and might have better performance. But it eliminates the
possibility of caching just one grammar (of one namespace). Then if both A
and B imports C, you will end up parsing C twice. But remember the reason
that we want grammar caching is because grammar parsing is expensive.

And if we decide to put everything in one big grammar, it would be really
painful if we ever want to separate them.

> > The application can override the location hint, by providing
> > a new location. But I didn't think about the case that the
> > application doesn't want to resolve this grammar at all.
> > Should we provide such flexibility? And how?
>
> I think it is essential.  I'd suggest returning null from
> resolveGrammar and having the parser take that as a definitive
> answer.

Well, there might be some really simple application, which don't care about
grammar caching at all, so they don't even implement GrammarResolver, or
wants to ignore the "resolveGrammar" call. So I don't think we can treat
"returning null" as a "definitive answer".

Again, let's consider the case of mutual-importing. When SchemaHandler (the
schema parser/compiler) sees that A imports B, it will ask GrammarResolver
for B. Now GrammarResolver has to return a grammar (since returning null
means the application doesn't want a grammar for namespace B), so
GrammarResolver has to call SchemaHandler to resolve B (maybe via what you
called GrammarResolutionContext). So this is inevitably an infinite loop.

To properly solve this mutual-importing case, we are having a new schema
compiling design. This is why we have a new SchemaHandler. (Please refer to
some postings from Neil Graham titled "[Xerces-2]: schema parsing design;
discussion starter [long]" and "[Xerces2] design motivations").

>From the new design, in the case of mutual-importing, schema documents A
and B will be compiled at the same time. So what GrammarResolver should
really do is to return null from resolveGrammar, and leave the schema
compiling job to the SchemaHandler.

So the application needs another way to tell the parser that it doesn't
want to resolve a grammar for a specific namespace, instead of returning
null.

> > If the intersection of C and C' (global-declared components)
> > is empty, then it's solvable: we can combine the declarations
> > in the two grammars into one grammar (at least conceptually). But
> > - checking whether the intersection is empty needs time (maybe a lot);
> > - if the intersection is empty, why not use a new C to
> > include both C and C' at the first place?
>
> The A and B grammars may have been loaded independently with
> different respected xsi:schemaLocations and the only time the
> conflict is detected is when the two namespaces are used together.
> You can prevent only the conflict if you are involved when the schemas
> are loaded.

Eventually, all grammars that's available to an instance are stored in the
local grammar set. To the instance, there is no concept of importing. All
grammars are at the same level. Then C and C' conflict.

To solve it, when I see B imports C', I'll check in the local grammar set
to see whether a grammar for that namespace is there. In this case, C is
already in the local set, so we'll use that for B, and don't parse C'.

> > But if the intersection is not empty, it would definitely
> > cause trouble, because sometimes content model is not enough,
> > and an instance can refer to some global declaration directly
> > (wildcards and xsi:type).
>
> I think the trouble can be avoided (or at least resolved within
> the leeway provided by the schema spec by treating one of the
> conflicting definitions as dominant).

Then consider this case: D imports A and B, while A imports C and B imports
C'. The schema spec clearly states that all these schema documents form a
big "schema", and it's an error if there are duplicate declarations in this
"schema". So choosing one as dominant is not what the spec intended.

> > So I'd like to see the parser has a
> > "one-grammar-per-namespace" rule for each instance.
>
> To avoid ever getting two distinct grammars for an namespace
> into a document, you really have to extend this to say that a
> grammar resolver only has one grammar for every namespace that
> imported by any of its known grammars.

Right. That's in fact what I really meant. The reason I didn't mention
"import" is that, to the instance, all grammars are at the same level, and
there is no such concept as "import".

> > I can see this type of applications: it knows that all the
> > instances it parses use the same 5 grammars. So it wants to
> > set these 5 grammars to the parser at once, and don't care
> > about any further grammar resolution issues. That is, it
> > implements "getInitialGrammarSet()", and ignore all other methods.
>
> If it has a HashMap then the body of resolveGrammer is simply
>
> Grammar resolveGrammar(String nsURI...) {
>          return map.get(nsURI);
> }
>
> Seems simpler just to implement this one method.

It's also simple to have

Grammar[]/Map/Hashtalbe getInitialGrammarSet() {
    return grammars;
}

Then different applications can choose the most proper way for it.

> resolveGrammar should be in control if a grammar is loaded
> from a disk file.  We have probably been missing a parameter
> like GrammarResolutionContext that provides services to
> resolveGrammar.  So a resolveGrammar call could look like:

As I said, GrammarResolver shouldn't be responsible for (or involved in)
grammar compiling. It only returns a cached grammar, or overrides a grammar
location. Otherwise mutual-importing will cause infinite loop.

> Instead of type being a String, would a Class be more
> appropriate?

I didn't think about it too much. What's the benefit of having grammarType
as a class? What fields/methods do you see in such class?

> For external DTD you may have two distinct keys, publidID
> and systemID.  For schema, either the namespace and
> schemaLocation hint (which could share the same parameter
> as systemId) might be used to resolve the grammar.

Right. So having grammarKey as a String might not be a good idea. A
GrammarKey interface would be more proper. And we can have DTDGrammarKey
and SchemaGrammarKey to implement such interface.

> Do you want the resolver to be able to override an internal
> subset?

Well, I don't. :-) But it might be a requirement from some application. One
never knows.

> If namespaceAware() = false or there is no namespace declared
> for the document element, namespaceURI would be null.  If there
> is a default namespace, a DTD resolver could just ignore it and
> resolve the DTD based on publicID or systemID.
>
> There really seems to be a multiplicity of things that any
> particular resolver might want to use to resolve the grammar.

What if in the future, there is a new grammar type, which uses
"something-weird" as its key? So I guess a GrammarKey interface would be
better than trying to enumerate all possible ways of keying grammars.

> An xsi:schemaLocation for a namespace would not trigger a call
> to resolveGrammar,

Then will this trigger resolverGrammar

 <import namespace="somens" schemaLocation="somelocation"/>

If so, there are still two cases where we call resolverGrammar. And if not,
it's possible that the grammar for "somens" is cached and we are not
benefitting from grammar caching.

> Again, resolving based on xsi:schemaLocation or a DTD's
> system ID should typically be inhibited except in desktop
> document development.

The thing that matters is not how many applications that need it, but
whether there is any application that need it. If there is one, we should
provide a way for it. In fact, I've seen many postings about how to
override schema/DTD locations. And it's not hard to support it, since we
already have an EntityResolver.

Thanks,
Sandy Gao
Software Developer, IBM Canada
(1-416) 448-3255
[EMAIL PROTECTED]



                                                                                
                                      
                    "Arnold, Curt"                                              
                                      
                    <[EMAIL PROTECTED]       To:     "'[EMAIL PROTECTED]'" 
<[EMAIL PROTECTED]>   
                    otech.com>              cc:                                 
                                      
                                            Subject:     RE: [Xerces2] How do 
we want Grammar Caching                 
                    08/21/2001 06:23                                            
                                      
                    PM                                                          
                                      
                    Please respond to                                           
                                      
                    xerces-j-user                                               
                                      
                                                                                
                                      
                                                                                
                                      



> > However, since A can import B and B import A and neither
> namespace can
> > be
> completely defined without having some elements from the
> other namespace present, I don't believe that you can cleanly
> separate an arbitrary schema into isolated namespace grammars.
>
> True. But we can't ignore the possibility that A imports B
> and the application wants to cache only B. So we can't put
> declarations from different namespaces in one big grammar.
> They have to be separated by namespace, and in different
> Grammar objects. And in each of such objects, there is a list
> of other grammars that it depends on. This serves both
> purposes:
> - Inter-dependent grammars are always together;
> - Independent grammar can be moved around (cached and
> assigned to other document instance).
>

I think that if namespaces are intertwined then trying to separate them is
very complex and it just seems to cause a lot of edge cases.

> > The GrammarResolver needs a mechanism to say that I don't have a
> > grammar
> but I don't want you to use schemaLocation either (since it
> might be DoS attack).
>
> The application can override the location hint, by providing
> a new location. But I didn't think about the case that the
> application doesn't want to resolve this grammar at all.
> Should we provide such flexibility? And how?

I think it is essential.  I'd suggest returning null from resolveGrammar
and having the parser take that as a definitive answer.

>
> > So if you have namespace A which imports HTML Strict and namespace B
> which imports HTML Frameset, all the schema definitions could
> coincide without conflict.  If A imports C and B imports C',
> then the complementary C' definitions could be added to the
> global element and type map and inconsistent C' definitions
> could still be accessed through the content model of a B element.
>
> If the intersection of C and C' (global-declared components)
> is empty, then it's solvable: we can combine the declarations
> in the two grammars into one grammar (at least conceptually). But
> - checking whether the intersection is empty needs time (maybe a lot);
> - if the intersection is empty, why not use a new C to
> include both C and C' at the first place?

The A and B grammars may have been loaded independently with different
respected xsi:schemaLocations and the only time the conflict is detected is
when the two namespaces are used together.  You can
prevent only the conflict if you are involved when the schemas are loaded.

> But if the intersection is not empty, it would definitely
> cause trouble, because sometimes content model is not enough,
> and an instance can refer to some global declaration directly
> (wildcards and xsi:type).

I think the trouble can be avoided (or at least resolved within the leeway
provided by the schema spec by treating one of the conflicting definitions
as dominant).

> So I'd like to see the parser has a
> "one-grammar-per-namespace" rule for each instance.

To avoid ever getting two distinct grammars for an namespace into a
document, you really have to extend this to say that a grammar resolver
only has one grammar for every namespace that imported by
any of its known grammars.


> > I think loading grammar by exception is cleaner and this method in
> unnecessary.  Say if you have 100 grammars in your pool, do
> you want to copy all 100 into the document if  you are only
> going to use three.
>
> By "exception", do you mean through "resolveGrammar()"?

Yes

> I can see this type of applications: it knows that all the
> instances it parses use the same 5 grammars. So it wants to
> set these 5 grammars to the parser at once, and don't care
> about any further grammar resolution issues. That is, it
> implements "getInitialGrammarSet()", and ignore all other methods.

If it has a HashMap then the body of resolveGrammer is simply

Grammar resolveGrammar(String nsURI...) {
           return map.get(nsURI);
}

Seems simpler just to implement this one method.

> Well, it's not always true. For example, if a grammar is
> resolve from a disk file, then it's only stored in a list
> local to the parser. "returnFinalGrammarSet()" is the only
> chance to send it back to the application.

resolveGrammar should be in control if a grammar is loaded from a disk
file.  We have probably been missing a parameter like
GrammarResolutionContext that provides services to resolveGrammar.  So a
resolveGrammar call could look like:

Grammar resolveGrammar(String nsURI,..., GrammarResolutionContext context)
{
           Grammar grammar = context.loadSchema(schemaLocationHint);
           myset.add(grammar);
           return grammar;
}

If you do something like:

Grammar resolveGrammar(String nsURI,..., GrammarResolutionContext context)
{
           Grammar grammar = context.loadDTD(publicID, systemID);
           myset.add(grammar);
           return grammar;
}

Then the call to context.loadDTD might trigger the traditional call to
entityResolver.

> >   // ask the application to provide a grammar for
> >   // the given namespace.
> >   public Grammar resolveGrammar(String grammarType,
> >                                 String grammarKey);
>
> > Is grammar type a requirement or a hint?
>
> In Xerces2, there are different validators and grammar
> parsers for different kinds of grammars (DTD, schema, ...),
> so we always know which type of grammar we are expecting,
> hence it's always possible to provide such grammarType parameter.
>
> Yes it does. Since different type of grammars use different
> validators, when the parser asks for a schema grammar, it has
> to get a schema grammar.

Instead of type being a String, would a Class be more appropriate?  Since
the DTDValidator is really expecting an object that implements DTDGrammar
and not just any instance of Grammar.

>
> > Is GrammarKey the Namespace URI or something else?
>
> The key has different meaning for different type of grammar.
> For schema, it's the namespace; for external DTD, it's a file
> location; for internal DTD, it's the root element.

For external DTD you may have two distinct keys, publidID and systemID.
For schema, either the namespace and schemaLocation hint (which could share
the same parameter as systemId) might be used to
resolve the grammar.

Do you want the resolver to be able to override an internal subset?

> > Qname can't just be "someprefix:sometag" since you have to have some
> mechanism to get the namespace URI.
>
> In Xerces, QName reflects all the information: prefix,
> resolved namespace uri, localpart, rawname.

Okay.

>
> > I would have done one resolveGrammar call something like:
> >
> >   public Grammar resolveGrammar(String namespaceURI,
> >                                 String publicId,
> >                                 String schemaLocationHintOrSystemID,
> >                                 String
> absolutizedSchemaLocationHint,
> >                                 String localName,
> >                                 short nodeType);
>
> My concerns about this method:
>
> 1. The need for grammarType. Many applications need to do
> both DTD and schema validation.

I understand the binding between validator and grammar now.  So add I'd add
a Class grammarType parameter.


> 2. The method is only for namespace-aware and namespace-keyed
> grammars. How about DTD? How about a future grammar that
> doesn't use namespace as key?

If namespaceAware() = false or there is no namespace declared for the
document element, namespaceURI would be null.  If there is a default
namespace, a DTD resolver could just ignore it and resolve
the DTD based on publicID or systemID.

There really seems to be a multiplicity of things that any particular
resolver might want to use to resolve the grammar.

A DTD resolver could use the namespace, public id, system id or element
name to determine the appropriate grammar.

A Schema resolver could use the namespace, system id (aka in scope
schemaLocation hint), namespace (or null) and element or attribute name.


> 2. The method mixes the two cases:
>   a. <... xsi:schemaLocatin="namespaceURI, locationHint"/>
>   b. <nsprefix:locapart .../>
> You can argue that if localName==null, then it's the first
> case. But an interface is meant to be clean.

I think these are still clean.  An xsi:schemaLocation for a namespace would
not trigger a call to resolveGrammar, only the use of an unrecognized
namespace.  So if I had:

<a:foo xmlns:b="http://www.example.org/bar"; xsi:schemaLocation="
http://www.example.org/bar bar.xsd">
           <a:description/>
           <a:whatever>
           <b:bar/>
</a:foo>

resolveGrammar would be triggered when processing <b:bar>, however the
in-scope schema location hint would be a useful piece of information.

> 4. You are expecting GrammarResolver to actually parse a
> grammar document. Consider the case of mutually-importing:
> SchemaHandler is trying to solve grammar A, and A imports B,
> so SchemaHandler calls GrammarResolver for B. GrammarResolver
> then decides to parse B, and call SchemaHandler to do so.
> Then SchemaHandler sees that B imports A, and ask for A from
> the GrammarResolver. An infinite loop! My idea is that
> GrammarResolver never tries to parse a grammar document. When
> a method of GrammarResolver is called, we are already in a
> validator or a grammar parser, which knows how to parse
> grammars. So there is no need for GrammarResolver to care
> about it. GrammarResolver is responsible for
> - return a grammar if that grammar is already parsed and cached
> - override a grammar location by returning an InputSource
> And we definitely need two methods for the two cases. This is
> why I have resolveGrammar and resolveGrammarLocation. Having
> a location hint as a parameter of resolveGrammar would lead
> the application to think that it's responsible of grammar parsing.

That seems to be the place where the previously mentioned
GrammarResolutionContext would come in.  I think the GrammarResolver should
be in full control, but it can depend on services provided by the
current parse context.

> > And if Grammar = null is returned, then the parser does not
> attempt to
> load the schemaLocationHint.  The simplest implementation
> would just ignore all parameters but the namespaceURI.
>
> Returning null only means that a grammar for this grammarKey
> is not cached. The parser then needs to resolve it from the
> location, and the application can override that location.

Again, resolving based on xsi:schemaLocation or a DTD's system ID should
typically be inhibited except in desktop document development.


> Well, we certainly can deal with this case. But the
> application might get some surprising an unexplainable errors
> (as I mentioned, wildcards and xsi:type). And I don't see why
> an application would want an instance to be validated against
> two different grammars with the same namespace at the same time.

Partitioning namespaces between schemas seems to be a pretty useful thing
and by the time you realize that you are using namespaces that imported
different partitions of the same namespace, it may be
too late.  If you ever resolve using xsi:schemaLocation, you have this
possibility.

> This "grammarConflict()" method has other uses:
> - the application returns a grammar of one type (DTD)  while
> the parser is asking for another type (schema).
> - the application returns a grammar for one namespace while
> the parser is asking for another namespace.

Those seem like the validation should throw an exception (or act as
GrammarResolver returned null).

>
> What's your opinion of whether to derive GrammarResolver from
> EntityResolver, or whether to remove resolveGrammarLocation()
> methods and use EntityResolver instead?
>

I think they should be distinct.  The current EntityResolver would come
into play when a GrammarResolver called context.loadDTD(publicID,systemID)
or something similar.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [Xerces2] How do we want Grammar Caching

Reply via email to