RE: [Xerces2] How do we want Grammar Caching

Arnold, Curt Thu, 09 Aug 2001 14:07:01 -0700
Sandy Gao wrote (in >)

>People are interested in grammar caching, and are asking 
>questions about it. But without knowing what people really want, 
>we couldn't make any decision on how to provide such functionality, 
>and couldn't answer any of the questions. So, instead, allow 
>me to ask this question first: how do you expect Xerces2 
>to provide grammar caching?

My personal take is that it should not be necessary to create any new classes or any 
special pools for caching grammers.  If the validation event routines are done 
appropriately, it will be trivial
for the calling application to cache grammers anyway they want using the standard map 
implementations.

> Assuming we have a grammar pool in the system somewhere, and 
> it contains a set of grammars that are already parsed (a set 
> of objects of Grammar class), let's consider the following 
> scenarios. I'll use schema for the examples, and use "grammar 
> A" as a short for "a schema grammar of target namespace A".

A few definitions for my discussion then:

Schema document: An XML file (or equivalent) that adheres to the schema for schemas.  
Defines elements, attributes, complexTypes, etc, in a target namespace but can also 
import other namespaces.

Schema resource: An information set that corresponding to an Schema document but 
including schema definitions from the imported namespaces (and namespaces imported by 
the imported schemas etc).  So a
schema resource can contain definitions from multiple namespaces.

Grammar: An abstract base class for a schema resource and a DTD.  This class should be 
stateless (after construction) and thread-safe.

Grammar Validator: A class that is used in the validation of a document that 
references one grammar.

Schema Location hint: The URL corresponding to a namespace from an xsi:schemaLocation 
attribute.  This hint may be willfully ignored by the application since it might be a 
denial of service hack.
Typically, only useful in schema development or IDE environments.

Current Grammar: If the definition for an element (or attribute) has been found in a 
particular schema resource and there are multiple definitions for a child element or 
attribute, the current grammar
would be the preferred resolution.

Grammar Map: During parsing, the document validator would maintain a map between 
namespaces and grammar validators.  This map would initially be empty.

There isn't a one-to-one correspondance between namespaces and schema resources (and 
grammars as I have defined them).  One schema resource could define multiple 
namespaces and the same namespace
could be imported independently in multiple schemas (and depending on how it was 
resolved could have conflicting definitions).

> 1. When we validate an instance, and see an element from 
> namespace A. To validate such element, we need to find 
> grammar A. How do we find it? (Assume grammar A is not 
> already know to such instance). 

Assuming that nothing has been preinitialized, when the validator encounters the first 
element, it will have no current grammar and an empty grammar map.  It should then 
call an event handler asking
the application to provide an appropriate grammar.  The appropriate grammar may depend 
on the element tag name or attribute name in addition to the namespace (for example, 
html:frame may be in one
grammar and html:html in another).  The instance document may have provided a hint 
with an xsi:schemaLocation that the application may choose to ignore.  The processor 
should provide any hint as an
absolute URI to the event handler.  So the callback should be something like:

Grammar locateGrammar(String nsURI, String name, int nodeType, String nsLocationHint, 
Map grammarMap);

The callback could return null indicating that no schema resource could be located, 
return an appropriate schema from an arbitrary source or load a grammar from the 
location hint.  If the event
handler returns null, the parser should NOT try to locate a schema resource by other 
means.  (Returning null would not necessarily cause the validation to fail,  the 
element was in <any
processContents="lax"/>)

Once that grammar is located, entries for the namespaces provided by the grammar would 
be added to the GrammarMap and it would become the current grammar.

When the validator encounters needs to locate the schema definition for a child 
element or attribute, it should check the element's definition (for locally defined 
child elements or attributes), then
the current grammars, then the grammar map and then generate a locateGrammar event.

The map parameters use is described in the next section.  It should be read-only 
(which could be enforced by a thin-wrapper of the Map interface that throws exceptions 
on add attempts).

> 2. When we are parsing grammar A, and it imports grammar B, 
> we need to find grammar B. How do we find it? (Assume grammar 
> B is not already know to such instance). The same choices as 
> those for (1).
> 
> This is slightly different from the one above. Some 
> application might choose different approach for the two 
> cases. One never knows.

Same basic approach, the application should be called with the namespace and an 
absolutized URL for any hint in the document.  However if the schema is being loaded 
as part of a locateGrammar event,
it should be able to import a grammar already in scope of the document (hence the Map 
grammarMap parameter).

<ns1:foo xmlns:ns1="http://www.example.org/ns1"; 
xsi:schemaLocation="http://www.example.org/ns1 http://www.example.org/ns1.xsd 
http://www.example.org/ns2 http://www.example.org/ns2.xsd";>
        <ns2:bar.../>
</ns1:foo>

For example, the schema resource loaded for ns1 when processing the foo element should 
be available for use (or could be ignored) in resolving schema imports.

> 3. After the parser parses a grammar, how will this grammar 
> be put into the grammar pool? a. The parser put the grammar 
> into the pool automatically. b. The parser just return the 
> grammar to the application, and it's up to the application to 
> decide whether to put such grammar into the pool.

The validator would have a map that is only used for the duration of the validation.  
The application could maintain its own list of grammars in the implementation of the 
grammar resolver if it so
desires.

> 
> Again, which approach is preferred?
> 
> 4. How many grammar pools should there be? And how 
> complicated should the grammar pool(s) be? a. One pool for 
> each application, and it can be as simple as a hashtable. b. 
> One pool for each application, and it must be thread safe. 
> That is, the grammar pool must be able to handle the case 
> where two or more threads try to get/put grammars (possibly 
> of the same namespace) at the same time. c. One pool for each 
> thread, and it can be as simple as a hashtable. d. Dynamic 
> numbers of grammar pools. The application can create as many 
> as it wants, and tell the parser which one to use at a 
> certain occasion.

There could be concrete implementations of the GrammarResolver interface that 
implement each of these behaviors, but the parser should not have to be exposed to it.

> 
> I can come up with two extreme solutions here. Any approach 
> in between could be what's in Xerces2.
> 
> [1] A clean design with less flexibility
> 
> Xerces provides a Grammar pool, which is shared across the 
> application. This grammar pool is thread-safe. The parser 
> gets/puts grammars into/from the grammar pool automatically. 
> It's like we choose "a a a b" for the above four questions. 
> This should be sufficient for many user cases, but the 
> applications won't be able to control how the grammar pool is 
> accessed.
> 
> [2] A flexible design
> 
> Extreme flexibility means we don't assume anything, hence we 
> couldn't implement the grammar pool (because any one 
> implementation might not fulfill some specific case). So we 
> leave the implementation of the grammar pool to the 
> application, and the application can implement it in any way it
> wants: one or more pools, thread-safe or not. Each time an 
> instance document is parsed (or a standalone grammar is 
> parsed), a list of grammars will be returned to (or accessed 
> by) the application. The application can then decide which 
> ones to cache. This is like we choose "b b b d" for the four 
> questions.

[2] is my choice, but we could provide concrete implementations of GrammarResolver for 
common use cases.  That is:

parser.setGrammarResolver(new SchemaLocationHintResolver());

> So please consider: Is [1] enough for our lives? Do we need 
> the flexibility of [2]. Which point between [1] and [2] is 
> most comfortable for us?
> 
> There are other questions about the grammar pool:
> - How do we access grammars in the grammar pool. For schemas, 
> it might be
> easier: we can use the target namespace. How about DTDs and 
> schemas without target namespace?

A null namespaceURI argument on locateGrammar, could still try to choose the the 
appropriate grammar based on the element name.  Maybe the Schema map can't be a map 
after all, but a list, so that you
could have multiple grammars for the null namespace.  Then you would have a priority 
based on position.  Since this map/list would be small, brute force location shouldn't 
be much more expense than
key lookup.

> - How do we deal with conflicting of grammars (for example, 
> two schema grammars with the same target namespace)?

If the same namespace is imported into two active schemas, the definition in the 
schema that was used to validate the parent element (the current schema) should take 
preference.  Then I'd let each
schema resource get a shot in order of loading.  The Schema WG decided to duck this 
issue since the whole schema resolution issue was ambiguous, so they decided that 
parsers could choose to avoid
conflicts.

> 
> But I guess we can answer them after we nail down what's 
> really needed for grammar caching.
> 
> I was trying to prepare a note to describe our thoughts about 
> how we were going to support grammar caching, and some 
> design/implement detail we could think of. But I found it 
> really difficult to say anything before we know what is 
> really desired. And DOM3 is trying to provide its way to do 
> grammar caching, which makes things even worse.
> 
> Anyway, no decision has been made about any aspect of grammar 
> caching. So make a wish! :-)
> 
> Cheers,
> Sandy Gao
> Software Developer, IBM Canada
> (1-416) 448-3255
> [EMAIL PROTECTED]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
RE: [Xerces2] How do we want Grammar Caching

Reply via email to