RE: [Xerces2] How do we want Grammar Caching

sandygao 21 Aug 2001 16:53:28 -0000

Hi Curt. Thanks for your interesting reply.

First of all, I want to clarify one thing: the "one-grammar-per-namespace"
rule only applies to one xml instance. That is, if a grammar G is resolved
for a certain namespace N for a certain instance I, then everything in I
with namespace N should be validated using G. Such rule has nothing to do
with how the application manage the grammars (and the grammar pool). It can
choose to have 100 grammars for one namespace. The parser won't care.


> However, since A can import B and B import A and neither namespace can be
completely defined without having some elements from the other namespace
present, I don't believe that you can cleanly separate an arbitrary schema
into isolated namespace grammars.

True. But we can't ignore the possibility that A imports B and the
application wants to cache only B. So we can't put declarations from
different namespaces in one big grammar. They have to be separated by
namespace, and in different Grammar objects. And in each of such objects,
there is a list of other grammars that it depends on. This serves both
purposes:
- Inter-dependent grammars are always together;
- Independent grammar can be moved around (cached and assigned to other
document instance).

> The parser shouldn't make this decision.  For instance, if you have two
partial schemas for different aspects of a namespace, for example HTML
frameset and HTML strict.  If one parse instance detects due to the
presence of the <frame> element that the frameset partial schema is
appropriate, it should no have any affect on another parse that has
detected the <body> element.

Agreed. This is why "one-grammar-per-namespace" rule is only for one xml
instance. You can have as many as you want in your application. And even if
you want to access two grammars of the same namespace in one instance, you
can always have a 10-line grammar to include the two subsets.

> The parser should know what grammers it has been been resolved for the
current document, but shouldn't have any memory between documents.  So the
second time within one document that a particular namespace is requested,
it should use the resource resolved from the first time.

Agreed again. The parser has a local list of grammars that are available to
the current parsing (current document). The "one-grammar-per-namespace"
rule is in fact applied on such list. And this list is released when the
parsing is finished (so no memory between documents).

> So the first time that a namespace is encountered, a GrammarResolver call
is made.  Any subsequent use of that namespace should not result in a
GrammarResolver call,

Right. Because such grammar is stored in a local list.

> The GrammarResolver needs a mechanism to say that I don't have a grammar
but I don't want you to use schemaLocation either (since it might be DoS
attack).

The application can override the location hint, by providing a new
location. But I didn't think about the case that the application doesn't
want to resolve this grammar at all. Should we provide such flexibility?
And how?

> So if you have namespace A which imports HTML Strict and namespace B
which imports HTML Frameset, all the schema definitions could coincide
without conflict.  If A imports C and B imports C', then the complementary
C' definitions could be added to the global element and type map and
inconsistent C' definitions could still be accessed through the content
model of a B element.

If the intersection of C and C' (global-declared components) is empty, then
it's solvable: we can combine the declarations in the two grammars into one
grammar (at least conceptually). But
- checking whether the intersection is empty needs time (maybe a lot);
- if the intersection is empty, why not use a new C to include both C and
C' at the first place?

But if the intersection is not empty, it would definitely cause trouble,
because sometimes content model is not enough, and an instance can refer to
some global declaration directly (wildcards and xsi:type).

So I'd like to see the parser has a "one-grammar-per-namespace" rule for
each instance.

> //***************************************
> public interface GrammarResolver // extends EntityResolver ??
> {
>   // we are trying to make this GrammarResolver
>   // work for all kinds of grammars, so we have
>   // a parameter "gramarType" for each of the
>   // methods: "schema", "dtd:internal",
>   // "dtd:external", etc.

For the following two methods, we can pass an array of Grammar objects
instead of a hashtable, which would be easier to understand. (The key of a
given grammar can be get from the grammar itself.)

>   // retrieve the initial known set of grammars
>   // this method is called before the validation
>   // starts. the application can provide an
>   // initial set of grammars that's available
>   // to the current validation attempt.
>   // REVISIT: do we make a copy of the returned
>   // hashtable, or do we add grammars directly
>   // into this returned hashtable? If we choose
>   // the latter, the following
>   // returnFinalGrammarSet() is not necessary.
>   public Hashtable getInitialGrammarSet(String grammarType);

> I think loading grammar by exception is cleaner and this method in
unnecessary.  Say if you have 100 grammars in your pool, do you want to
copy all 100 into the document if  you are only going to use three.

By "exception", do you mean through "resolveGrammar()"?

I can see this type of applications: it knows that all the instances it
parses use the same 5 grammars. So it wants to set these 5 grammars to the
parser at once, and don't care about any further grammar resolution issues.
That is, it implements "getInitialGrammarSet()", and ignore all other
methods.

For other types of application (having many grammars cached, but having no
idea of which ones will be used by a certain instance), you can always
ignore "getInitialGrammarSet()", and implement "resolveGrammar()".

>
>   // return the final set of grammars
>   // this method is called after the validation
>   // finishes. the application can then choose
>   // to cache some of the returned grammars.
>   public void returnFinalGrammarSet(String grammarType,
>                                     Hashtable grammars);

> Again, I think you had the option to participate in every schema
resolution event and if you wanted the whole list, you had a chance to
assemble it yourself.

Well, it's not always true. For example, if a grammar is resolve from a
disk file, then it's only stored in a list local to the parser.
"returnFinalGrammarSet()" is the only chance to send it back to the
application.

Another solution is to replace this method with another one:
"returnOneResolvedGrammar()", which is used to return one grammar after
it's resolved. But different application might expect different result from
such method. For example, if A imports B, then do we call this method just
once for A, or do we call it twice for A and B. How about the case where A
and B are mutually-imported?

>
>   // ask the application to provide a grammar for
>   // the given namespace.
>   public Grammar resolveGrammar(String grammarType,
>                                 String grammarKey);

> Is grammar type a requirement or a hint?

In Xerces2, there are different validators and grammar parsers for
different kinds of grammars (DTD, schema, ...), so we always know which
type of grammar we are expecting, hence it's always possible to provide
such grammarType parameter.

> If Grammer is an abstract interface, does it matter if I return an
Grammar the represents an XML Schema or a Grammar that represents a Relax
NG Grammar.

Yes it does. Since different type of grammars use different validators,
when the parser asks for a schema grammar, it has to get a schema grammar.

> Is GrammarKey the Namespace URI or something else?

The key has different meaning for different type of grammar. For schema,
it's the namespace; for external DTD, it's a file location; for internal
DTD, it's the root element.

>
>   // ask the application to provide a grammar for
>   // a certain node: the qname of the node plus the
>   // node type: element or attribute.
>   // I don't like this method, but I can see it's
>   // useful in some cases.
>   public Grammar resolveGrammar(String grammarType,
>                                 QName nodeName,
>                                 int nodeType);

> Qname can't just be "someprefix:sometag" since you have to have some
mechanism to get the namespace URI.

In Xerces, QName reflects all the information: prefix, resolved namespace
uri, localpart, rawname.

> I would have done one resolveGrammar call something like:
>
>   public Grammar resolveGrammar(String namespaceURI,
>                                 String publicId,
>                                 String schemaLocationHintOrSystemID,
>                                 String absolutizedSchemaLocationHint,
>                                 String localName,
>                                 short nodeType);

My concerns about this method:

1. The need for grammarType. Many applications need to do both DTD and
schema validation.

2. The method is only for namespace-aware and namespace-keyed grammars. How
about DTD? How about a future grammar that doesn't use namespace as key?

2. The method mixes the two cases:
  a. <... xsi:schemaLocatin="namespaceURI, locationHint"/>
  b. <nsprefix:locapart .../>
You can argue that if localName==null, then it's the first case. But an
interface is meant to be clean.

4. You are expecting GrammarResolver to actually parse a grammar document.
Consider the case of mutually-importing: SchemaHandler is trying to solve
grammar A, and A imports B, so SchemaHandler calls GrammarResolver for B.
GrammarResolver then decides to parse B, and call SchemaHandler to do so.
Then SchemaHandler sees that B imports A, and ask for A from the
GrammarResolver. An infinite loop!
My idea is that GrammarResolver never tries to parse a grammar document.
When a method of GrammarResolver is called, we are already in a validator
or a grammar parser, which knows how to parse grammars. So there is no need
for GrammarResolver to care about it. GrammarResolver is responsible for
- return a grammar if that grammar is already parsed and cached
- override a grammar location by returning an InputSource
And we definitely need two methods for the two cases. This is why I have
resolveGrammar and resolveGrammarLocation. Having a location hint as a
parameter of resolveGrammar would lead the application to think that it's
responsible of grammar parsing.

> And if Grammar = null is returned, then the parser does not attempt to
load the schemaLocationHint.  The simplest implementation would just ignore
all parameters but the namespaceURI.

Returning null only means that a grammar for this grammarKey is not cached.
The parser then needs to resolve it from the location, and the application
can override that location.

>
>   // the grammar or one of the grammars it depends
>   // on (directly or indirectly) conflicts with a
>   // grammar known to the parser.
>   // the application is asked to provide another
>   // grammar for the same target namespace.
>   // the application can choose to return null,
>   // which has the same effect as returning null
>   // from resolveGrammar().
>   public Grammar grammarConflict(String grammarType,
>                                  Grammar grammar);

> If you do not let an schema to add a duplicate element to the
document-scope element and type maps (or alternatively you always check the
first loaded Grammar in preference to secondary Grammars), then conflicts
aren't a problem.

Well, we certainly can deal with this case. But the application might get
some surprising an unexplainable errors (as I mentioned, wildcards and
xsi:type). And I don't see why an application would want an instance to be
validated against two different grammars with the same namespace at the
same time.

This "grammarConflict()" method has other uses:
- the application returns a grammar of one type (DTD)  while the parser is
asking for another type (schema).
- the application returns a grammar for one namespace while the parser is
asking for another namespace.

>
>   // REVISIT:
>   // the following two methods are used to override
>   // schemaLocation hints. They don't really need to
>   // be here, especially the one for include/redefine.
>   // the application can override the locations by
>   // the old-fashion EntityResolver.
>
>   // ask the application to override import
>   // schemaLocation.
>   public XMLInputSource resolveGrammarLocation(String grammarType,
>                                                String grammarKey,
>                                                String hint);
>   // ask the application to override include/redefine
>   // schemaLocation.
>   public XMLInputSource resolveGrammarLocation(String grammarType,
>                                                String hint);
> }
>

What's your opinion of whether to derive GrammarResolver from
EntityResolver, or whether to remove resolveGrammarLocation() methods and
use EntityResolver instead?

Cheers,
Sandy Gao
Software Developer, IBM Canada
(1-416) 448-3255
[EMAIL PROTECTED]



                                                                                
                                      
                    "Arnold, Curt"                                              
                                      
                    <[EMAIL PROTECTED]       To:     "'[EMAIL PROTECTED]'" 
<[EMAIL PROTECTED]>   
                    otech.com>              cc:                                 
                                      
                                            Subject:     RE: [Xerces2] How do 
we want Grammar Caching                 
                    08/20/2001 06:20                                            
                                      
                    PM                                                          
                                      
                    Please respond to                                           
                                      
                    xerces-j-user                                               
                                      
                                                                                
                                      
                                                                                
                                      



> So my take is that the output should be a set (array) of
> "grammar"s. They serve the same purpose as a big "schema",
> but it's easy to take just one of them and reuse it.

However, since A can import B and B import A and neither namespace can be
completely defined without having some elements from the other namespace
present, I don't believe that you can cleanly
separate an arbitrary schema into isolated namespace grammars.

>
> 2. One/many schema grammar(s) for a namespace (or no-namespace)
>
> While applications want extreme flexibility from the parser,
> should we extend such flexibility to a degree that for a
> given namespace, there could be more than one corresponding
> grammars? Though the spec doesn't forbid this, but it allows
> (and to some degree, encourage) the parser to have a
> one-grammar-per-namespace rule.

The parser shouldn't make this decision.  For instance, if you have two
partial schemas for different aspects of a namespace, for example HTML
frameset and HTML strict.  If one parse instance detects
due to the presence of the <frame> element that the frameset partial schema
is appropriate, it should no have any affect on another parse that has
detected the <body> element.
>
> For example, a grammar for namespace N is already known to
> the parser, then it sees an element from N. Does the parser
> use that known grammar, or does it ask the application for
> another one? Another example: A imports B and C, then both B
> and C import D. Does the parser ask the application for D
> twice (which might result in two different D's, and note that
> it would be an error if the two D's have element declarations
> of the same name)?

The parser should know what grammers it has been been resolved for the
current document, but shouldn't have any memory between documents.  So the
second time within one document that a particular
namespace is requested, it should use the resource resolved from the first
time.

You can have two different element declarations of the same name within
one schema document.  You just can't have two with namespace wide scope.
If you set a precedence, that if there are two
namespace-scoped definitions, the one in the first namespace loaded wins,
then there is no need for an error.


> Based on these two points, I'll try to describe what I have
> seen from the requirements, and how they can be implemented.
> But it doesn't mean that this will be the implementation in
> Xerces2. The final design depends on how necessary it is, how
> difficult it is, and how it conforms to / conflicts with
> other standards. In fact, lots of ideas below are from the
> responses of you guys (thanks :-)). I'm just trying to put
> them together.
>
> 1. Schema compiling
>
> If a schema document include/redefine another schema
> document, the application will be given an opportunity to
> override the schemaLocation hint.
>
> If a schema document import a namespace (or no-namespace),
> a. if a grammar for such namespace is already known to the
> parser, then that grammar is used;

Again, I feel the parser should have no schema memory between documents
that any memory be in the implementation of the GrammarResolver.  So the
first time that a namespace is encountered, a
GrammarResolver call is made.  Any subsequent use of that namespace should
not result in a GrammarResolver call, you definitely want to avoid the case
of doing 5000 attempts to resolve a bad schema
location during one parse.

b. otherwise, the
> application will be given an opportunity to provide the
> grammar object for that namespace;

This should be the typical case.

c. if a grammar is not
> provided from (b), then the application can override the
> schemaLocation hint.

The GrammarResolver needs a mechanism to say that I don't have a grammar
but I don't want you to use schemaLocation either (since it might be DoS
attack).


>
> 2. Instance validation against schema
>
> For an element/attribute from a certain namespace,
> a. if a grammar for such namespace is already known to the
> parser, then that grammar is used; b. otherwise, the
> application will be given an opportunity to provide the
> grammar object for that namespace.
>
> For an xsi:schemaLocation/noNamespaceSchemaLocation
> attribute, a. if a grammar for such namespace is already
> known to the parser, then that grammar is used; b. otherwise,
> the application will be given an opportunity to provide the
> grammar object for that namespace; c. if a grammar is not
> provided from (b), then the application can override the
> schemaLocation hint.
>
> 3. DTD compiling
> 4. Instance validation against DTD
>
> I don't know DTD as much as folks do to come up with a good
> algorithm. So I need help from you guys again. We might need
> to consider the following questions. a. How to cache internal
> DTD subset vs. external DTD subset; b. What's the minimum
> unit of DTD grammar to cache? An internal subset? An external
> subset (corresponding to a .DTD file)? Or a group of
> internal/external subsets? c. Any other questions that block
> us from a clean and specific design.
>
>
> Having nailed down the algorithm for schema validation, we
> can start to
> *imagine* some implementation details. (Note: this is just
> *imagination" instead of "decision"; and we need to fill the
> blanks for DTD.)
>
> To the parser, there is a channel through which the parser
> communicates with the application. Naturally, we can have an
> interface with callback methods to serve as such channel (a
> name GrammarResolver seems to be acceptable). I've attached
> an interface prototype of GrammarResolver at the end of the
> message. Some points that are worth pointing out:
>
> 1. According to our earlier discussion, this interface works
> for Schema. There are still open issues:
>  - How does it support caching DTD grammars?
>  - How does it support caching other types of grammars?
>  - Is this interface suficient for caching any kinds of grammars?
>
> 2. Note that the two "resolveGrammarLocation" methods at the
> end seem overlapping with "resolveEntity" method of
> "EntityResolver": they are all used to override an location.
> And in Xerces1, we did use EntityResolver to override schema
> location hints. So it causes confusion when both
> GrammarResolver and EntityResolver are specified: which one
> to use, or which one takes higher priority? To solve this, we
> can drop the two "resolveGrammarLocation" methods, and use
> the EntityResolver all the time, but we are risking losing
> grammar type and, possibly, namespace information. We
> currently lean more to another solution: derive
> GrammarResolver from EntityResolver. So that you only need to
> set one resolver for such purpose, and we can clearly state
> which method(s) (the one from EntitiResolver or
> GrammarResolver) takes precedence. Any comments on either of
> the two approaches?
>
> 3. There is still a problem for the design of schema caching.
> Assume grammar A (that is, a grammar with target namespace A)
> is known to the parser, then the parser asks the application
> for grammar B. The application returns grammar B, which
> imports a different grammar A. Now the two A's conflict (we
> assumed a one-grammar-per-namespace rule). To avoid such
> confliction, in the GrammarResolver interface, we ask the
> application to provide a different grammar B in this case
> (method grammarConflict()).

There is nothing inherently evil about having two distinct schemas for one
namespace, you just don't allow subsequently loaded schemas for the same
namespace replace any previously loaded namespace
scoped elements or types.

So if you have namespace A which imports HTML Strict and namespace B which
imports HTML Frameset, all the schema definitions could coincide without
conflict.  If A imports C and B imports C', then the
complementary C' definitions could be added to the global element and type
map and inconsistent C' definitions could still be accessed through the
content model of a B element.

>
> 4. The parser doesn't interact with a grammar pool directly.
> It's the job of the GrammarResolver to get grammars from the
> pool, and give them to the parser. So the caching detail is
> separated from the validation process of the parser, and is
> under full control of the application.

Definitely my favored approach.

>
> 5. Xerces2 will provide a default implementation of
> GrammarResolver, which interacts with a default
> implementation of grammar pool: a. The pool is shared across
> the application; b. The pool is thread safe; c. Put every
> grammar in the pool; d. Always try to get grammars from the
> pool first.

At least one, but I would try to make GrammerResolver intuitive enough that
providing you own isn't a huge deal.

>
> Is this default behavior satisfying? Of course, people can
> always contribute other general-purpose
> GrammarResolver/GrammarPool implementations.
>
> Did I miss anything?
>
> Cheers,
> Sandy Gao
> Software Developer, IBM Canada
> (1-416) 448-3255
> [EMAIL PROTECTED]
>
> //***************************************
> public interface GrammarResolver // extends EntityResolver ??
> {
>   // we are trying to make this GrammarResolver
>   // work for all kinds of grammars, so we have
>   // a parameter "gramarType" for each of the
>   // methods: "schema", "dtd:internal",
>   // "dtd:external", etc.
>
>   // retrieve the initial known set of grammars
>   // this method is called before the validation
>   // starts. the application can provide an
>   // initial set of grammars that's available
>   // to the current validation attempt.
>   // REVISIT: do we make a copy of the returned
>   // hashtable, or do we add grammars directly
>   // into this returned hashtable? If we choose
>   // the latter, the following
>   // returnFinalGrammarSet() is not necessary.
>   public Hashtable getInitialGrammarSet(String grammarType);

I think loading grammar by exception is cleaner and this method in
unnecessary.  Say if you have 100 grammars in your pool, do you want to
copy all 100 into the document if  you are only going to use
three.

>
>   // return the final set of grammars
>   // this method is called after the validation
>   // finishes. the application can then choose
>   // to cache some of the returned grammars.
>   public void returnFinalGrammarSet(String grammarType,
>                                     Hashtable grammars);

Again, I think you had the option to participate in every schema resolution
event and if you wanted the whole list, you had a chance to assemble it
yourself.

>
>   // ask the application to provide a grammar for
>   // the given namespace.
>   public Grammar resolveGrammar(String grammarType,
>                                 String grammarKey);

Is grammar type a requirement or a hint?  If Grammer is an abstract
interface, does it matter if I return an Grammar the represents an XML
Schema or a Grammar that represents a Relax NG Grammar.

Is GrammarKey the Namespace URI or something else?

>
>   // ask the application to provide a grammar for
>   // a certain node: the qname of the node plus the
>   // node type: element or attribute.
>   // I don't like this method, but I can see it's
>   // useful in some cases.
>   public Grammar resolveGrammar(String grammarType,
>                                 QName nodeName,
>                                 int nodeType);

Qname can't just be "someprefix:sometag" since you have to have some
mechanism to get the namespace URI.  I would have done one resolveGrammar
call something like:

    public Grammar resolveGrammar(String namespaceURI,
                                                        String publicId,
                                  String schemaLocationHintOrSystemID,
                                  String absolutizedSchemaLocationHint,
                                                        String localName,
                                  short nodeType);

And if Grammar = null is returned, then the parser does not attempt to load
the schemaLocationHint.  The simplest implementation would just ignore all
parameters but the namespaceURI.
>
>   // the grammar or one of the grammars it depends
>   // on (directly or indirectly) conflicts with a
>   // grammar known to the parser.
>   // the application is asked to provide another
>   // grammar for the same target namespace.
>   // the application can choose to return null,
>   // which has the same effect as returning null
>   // from resolveGrammar().
>   public Grammar grammarConflict(String grammarType,
>                                  Grammar grammar);

If you do not let an schema to add a duplicate element to the
document-scope element and type maps (or alternatively you always check the
first loaded Grammar in preference to secondary Grammars),
then conflicts aren't a problem.


>
>   // REVISIT:
>   // the following two methods are used to override
>   // schemaLocation hints. They don't really need to
>   // be here, especially the one for include/redefine.
>   // the application can override the locations by
>   // the old-fashion EntityResolver.
>
>   // ask the application to override import
>   // schemaLocation.
>   public XMLInputSource resolveGrammarLocation(String grammarType,
>                                                String grammarKey,
>                                                String hint);
>   // ask the application to override include/redefine
>   // schemaLocation.
>   public XMLInputSource resolveGrammarLocation(String grammarType,
>                                                String hint);
> }
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [Xerces2] How do we want Grammar Caching

Reply via email to