Hi. Thank you all for the interesting replies. Here are my two concerns:
1. The output of schema compiling
>From the schema spec, a schema document can include/import other schema
documents, and the combination of all these schema documents is the final
"schema" that an xml instance is validated against. So it seems reasonable
that the output of schema compiling is such big "schema". But remember that
the spec only cares about validation, not grammar caching.
As Neil Graham pointed out (and as the spec intended), a grammar (a set of
declarations bound to a certain namespace) is identified by a namespace.
And such "grammar" should be the minimum unit to be built and cached.
If we put all declarations from different namespaces in one big "schema"
object, then there is no (easy) way to just cache the declarations from a
certain namespace. It's reasonable to expect that if A imports C and B
imports C, then C should be reused instead of being parsed again. And I did
see requirements for such reuse.
So my take is that the output should be a set (array) of "grammar"s. They
serve the same purpose as a big "schema", but it's easy to take just one of
them and reuse it.
2. One/many schema grammar(s) for a namespace (or no-namespace)
While applications want extreme flexibility from the parser, should we
extend such flexibility to a degree that for a given namespace, there could
be more than one corresponding grammars? Though the spec doesn't forbid
this, but it allows (and to some degree, encourage) the parser to have a
one-grammar-per-namespace rule.
For example, a grammar for namespace N is already known to the parser, then
it sees an element from N. Does the parser use that known grammar, or does
it ask the application for another one? Another example: A imports B and C,
then both B and C import D. Does the parser ask the application for D twice
(which might result in two different D's, and note that it would be an
error if the two D's have element declarations of the same name)?
Of course, the parser can manage to handle all such cases, but it makes the
parser unnecessarily complicated. If you really need the two D's, you can
always have a simple schema to include both of them.
This is just from the parser's point of view. The application can have any
number of grammars for a certain namespace, and manage them in any way it
wants. But only one of all such grammars can be seen by a parser instance.
Based on these two points, I'll try to describe what I have seen from the
requirements, and how they can be implemented. But it doesn't mean that
this will be the implementation in Xerces2. The final design depends on how
necessary it is, how difficult it is, and how it conforms to / conflicts
with other standards. In fact, lots of ideas below are from the responses
of you guys (thanks :-)). I'm just trying to put them together.
1. Schema compiling
If a schema document include/redefine another schema document, the
application will be given an opportunity to override the schemaLocation
hint.
If a schema document import a namespace (or no-namespace),
a. if a grammar for such namespace is already known to the parser, then
that grammar is used;
b. otherwise, the application will be given an opportunity to provide the
grammar object for that namespace;
c. if a grammar is not provided from (b), then the application can override
the schemaLocation hint.
2. Instance validation against schema
For an element/attribute from a certain namespace,
a. if a grammar for such namespace is already known to the parser, then
that grammar is used;
b. otherwise, the application will be given an opportunity to provide the
grammar object for that namespace.
For an xsi:schemaLocation/noNamespaceSchemaLocation attribute,
a. if a grammar for such namespace is already known to the parser, then
that grammar is used;
b. otherwise, the application will be given an opportunity to provide the
grammar object for that namespace;
c. if a grammar is not provided from (b), then the application can override
the schemaLocation hint.
3. DTD compiling
4. Instance validation against DTD
I don't know DTD as much as folks do to come up with a good algorithm. So I
need help from you guys again. We might need to consider the following
questions.
a. How to cache internal DTD subset vs. external DTD subset;
b. What's the minimum unit of DTD grammar to cache? An internal subset? An
external subset (corresponding to a .DTD file)? Or a group of
internal/external subsets?
c. Any other questions that block us from a clean and specific design.
Having nailed down the algorithm for schema validation, we can start to
*imagine* some implementation details. (Note: this is just *imagination"
instead of "decision"; and we need to fill the blanks for DTD.)
To the parser, there is a channel through which the parser communicates
with the application. Naturally, we can have an interface with callback
methods to serve as such channel (a name GrammarResolver seems to be
acceptable). I've attached an interface prototype of GrammarResolver at the
end of the message. Some points that are worth pointing out:
1. According to our earlier discussion, this interface works for Schema.
There are still open issues:
- How does it support caching DTD grammars?
- How does it support caching other types of grammars?
- Is this interface suficient for caching any kinds of grammars?
2. Note that the two "resolveGrammarLocation" methods at the end seem
overlapping with "resolveEntity" method of "EntityResolver": they are all
used to override an location. And in Xerces1, we did use EntityResolver to
override schema location hints. So it causes confusion when both
GrammarResolver and EntityResolver are specified: which one to use, or
which one takes higher priority? To solve this, we can drop the two
"resolveGrammarLocation" methods, and use the EntityResolver all the time,
but we are risking losing grammar type and, possibly, namespace
information. We currently lean more to another solution: derive
GrammarResolver from EntityResolver. So that you only need to set one
resolver for such purpose, and we can clearly state which method(s) (the
one from EntitiResolver or GrammarResolver) takes precedence. Any comments
on either of the two approaches?
3. There is still a problem for the design of schema caching. Assume
grammar A (that is, a grammar with target namespace A) is known to the
parser, then the parser asks the application for grammar B. The application
returns grammar B, which imports a different grammar A. Now the two A's
conflict (we assumed a one-grammar-per-namespace rule). To avoid such
confliction, in the GrammarResolver interface, we ask the application to
provide a different grammar B in this case (method grammarConflict()).
4. The parser doesn't interact with a grammar pool directly. It's the job
of the GrammarResolver to get grammars from the pool, and give them to the
parser. So the caching detail is separated from the validation process of
the parser, and is under full control of the application.
5. Xerces2 will provide a default implementation of GrammarResolver, which
interacts with a default implementation of grammar pool:
a. The pool is shared across the application;
b. The pool is thread safe;
c. Put every grammar in the pool;
d. Always try to get grammars from the pool first.
Is this default behavior satisfying? Of course, people can always
contribute other general-purpose GrammarResolver/GrammarPool
implementations.
Did I miss anything?
Cheers,
Sandy Gao
Software Developer, IBM Canada
(1-416) 448-3255
[EMAIL PROTECTED]
//***************************************
public interface GrammarResolver // extends EntityResolver ??
{
// we are trying to make this GrammarResolver
// work for all kinds of grammars, so we have
// a parameter "gramarType" for each of the
// methods: "schema", "dtd:internal",
// "dtd:external", etc.
// retrieve the initial known set of grammars
// this method is called before the validation
// starts. the application can provide an
// initial set of grammars that's available
// to the current validation attempt.
// REVISIT: do we make a copy of the returned
// hashtable, or do we add grammars directly
// into this returned hashtable? If we choose
// the latter, the following
// returnFinalGrammarSet() is not necessary.
public Hashtable getInitialGrammarSet(String grammarType);
// return the final set of grammars
// this method is called after the validation
// finishes. the application can then choose
// to cache some of the returned grammars.
public void returnFinalGrammarSet(String grammarType,
Hashtable grammars);
// ask the application to provide a grammar for
// the given namespace.
public Grammar resolveGrammar(String grammarType,
String grammarKey);
// ask the application to provide a grammar for
// a certain node: the qname of the node plus the
// node type: element or attribute.
// I don't like this method, but I can see it's
// useful in some cases.
public Grammar resolveGrammar(String grammarType,
QName nodeName,
int nodeType);
// the grammar or one of the grammars it depends
// on (directly or indirectly) conflicts with a
// grammar known to the parser.
// the application is asked to provide another
// grammar for the same target namespace.
// the application can choose to return null,
// which has the same effect as returning null
// from resolveGrammar().
public Grammar grammarConflict(String grammarType,
Grammar grammar);
// REVISIT:
// the following two methods are used to override
// schemaLocation hints. They don't really need to
// be here, especially the one for include/redefine.
// the application can override the locations by
// the old-fashion EntityResolver.
// ask the application to override import
// schemaLocation.
public XMLInputSource resolveGrammarLocation(String grammarType,
String grammarKey,
String hint);
// ask the application to override include/redefine
// schemaLocation.
public XMLInputSource resolveGrammarLocation(String grammarType,
String hint);
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]