comments from IBM Xerces-J developers on JAXP 1.3

Neil Graham Thu, 15 Jan 2004 20:30:29 -0800



Hello,

The following comments represent a consensus of the developers from IBM involved with 
Xerces-J.

Version.java, AbstractVersion.java, VersionImpl.java:

The design of the means of querying version information about JAXP causes
us considerable concern.  As currently specified, we're worried that the
information returned by these classes may not correspond with the objects
actually returned by the various JAXP factories, because of the differences
between the factory mechanisms employed here and those in the other
factories.  Other problems with the design include the fact that,
often--and in the reference implementation in particular--different
products implement the transform and parser portions of the API.  Hence,
it's far from clear what a Version.getImplementationTitle should return in
the current framework:  should it be Xerces or Xalan for the reference
implementation?

We propose that a getVersion method should be included on each of the
objects manufacturable by JAXP (e.g., DocumentBuilder and SAXParser).  This
way, each component of the underlying implementation could specify
independently what vendor implemented it and what version of JAXP it
implements.  This would allow the elimination of both VersionImpl and
AbstractVersion, cleaning up the API and making the currently somewhat
inscrutable versioning mechanism easy for all to understand.

We also think that not all the methods on the current Version class are
necessary.  For instance, we aren't aware of any compelling reason to
preserve getExtensionName.  getImplementationVendorId() seems even more
difficult to justify.

SecureProcessing:

We believe that this class is not useful.  It's clear that there are
certain constructs in XML that may cause problems for certain parser and/or
transformer implementations.  But it is also clear that this set will vary
from processor to processor, and hence, it simply isn't possible to provide
a generic, customizable means for applications to set all parameters
relating to security in all implementations.  Therefore, we don't see any
value in cluttering the API with an additional class; rather, we believe
that, in any class where certain implementations may exhibit security
problems for certain XML constructs, a feature should be provided to force
the implementation to process those consturcts in a secure, if
non-conformant, manner.  The precise semantics of this feature will
necessarily have to vary from implementation to implementation; but this is
inescapable given the purpose at issue.

The problems that can result are manifested in the current state of the
code:
Entity expansion limit and methods which handle it are underspecified in
the spec.  In Xerces we define it as "the number of entity expansions that
the parser
should permit in the document".  JAXP provides no definition.  It's meaning
shouldn't be left to the imagination.
The class' default constructor sets 'reasonable' processing values.  What
is "reasonable" will surely vary with the implementation--hence, the spec
itself cannot make any claims to provide this in a default class that's
part of the API and knows nothing of any implementation details; nor can
any application that wishes to remain implementation-agnostic meaningfully
set this field.  The entity expansion limit in the spec is 100.  Xerces' is

currently 100,000.  Doubtless there are many reasonable documents with more
than 100 entity references which don't consume an unreasonable amount of
resources in the vast majority of implementations.  Feedback from other
implementors leads us to believe that the max occurs exploit for an element
in a schema is really not inherent to the processing of schema, but is
instead the product of implementation choices.  Hence, the spec should not
provide any means to limit this that all implementations are then required
to support.

In any event, it should also be pointed out that in the description of
javax.xml.SecureProcessing, occurrence and occurrences are mispelled in
several places as occurance and occurances.
maxOccurs is incorrectly referred to as maxOccur.

XMLConstants:  XML_DTD_NS_URI:  change description to say what DOM level
3 does in a similar context (see
http://www.w3.org/TR/2003/CR-DOM-Level-3-Core-20031107/core.html#parameter-schema-type),
 rather than speaking of arbitrary values.

XMLUtils:

Overall, we think this class is not useful and should be eliminated.  The
need to deal with both XML 1.0 and 1.1 in this class makes it unwieldy;
but, why are NCNames called out specifically?  Is it not just as reasonable
(or, we think, no more unlikely) that a user would be interested in Names,
NameStartChars etc.?

javax.xml.validation:

Our feeling is that this specification needs to confine itself to API's,
and that it should leave implementation details entirely up to the
implementors.  Even if it appears to the authors that all implementations
will need to do the same thing in particular places, this should still be
left to implementations to encourage the maximum amount of innovation.

With this in mind, we believe AbstractSchema--and the package-protected
classes that it relies upon--should be removed.  By the same token, it
seems to us that the existing factory mechanism is far too complex, and
that the model used in the XPath package is much more appropriate.  That
is, SchemaFactoryFinder should be made private and SchemaFactoryLoader
should be removed.

We also noticed that newInstance methods are not consistently labelled
final:  While XPathFactory.newInstance() is labeled final,
SchemaFactory.newInstance() is not.

Turning to the TypeInfoProvider class:  getElementTypeInfo() states that
"the caller can keep references to the returned TypeInfo longer than the
callback scope".  We believe that this puts the performance onus the wrong
way round:  The primary use for this interface, we feel, is in the SAX
world; the tradition in SAX is that applications should copy the
information they wish, since all objects passed to the application are
owned by the parser.  Applications wishing to generate a DOM from these
callbacks will need to pay the cost of building the DOM, and we think that
obliging them to copy information from the TypeInfo objects so the parser
may reuse them is a comparatively small price to pay.

The DOM level 3 spec also defines isId()on Attr nodes; javadocs for the
isIdAttribute() method should refer to this information.  Also, the javadoc
should probably state that isId can only be understood with reference to
whatever grammar specification was used in the production of the given
Schema.

It seems many places in this package say "should" or "should be null" wen
"must" would be preferable.
For example, SchemaFactory's protected constructor says that "derived
classes *should* create SchemaFactory objects that have [a] null
ErrorHandler and null LSResourceResolver"; clearly it would be best for
interoperability if this were changed to "must".
The same holds for Validator and ValidatorHandler.

We have a strongg sense that users will be surprised by the draconian
error handling behaviour when no error handling implementation is attached
to a Validator, since this behaviour differs from that specified in the
rest of the API. This should be called out emphatically in the spec.

3.3.10.1:  should specify what happens if end up having two
schemas with same targetNamespace.  Similarly, the behaviour to be expected
for this edge-case should be spelled out for the JAXP 1.2 properties.
Additionally, whether the ordering of the schema documents themselves is
sigifnicant should be specified.  That is, if schema A with target
namespace tnsA imports schema B with target namespace tnsB, does an
implementation's behaviour have to be identical if they are specified as
{A,B} and {B,A}?  Currently in Xerces, the latter will cause the preparsed
B to be used; the former will generate a request for B to be resolved.
Also, reference to 4.2 of the schema spec should be dropped since
that section does not discuss this kind of processor- (or processor API-)
defined behaviour.

The same section has text to the effect that if no error handler is
registered, no error will be reported to the error handler and an exception
gets thrown.  This circularity should be cleaned up.

Near the beginning of the ValidatorHandler javadoc, it is stated that
"[s]imilarly, the user-specified callback will receive non-null strings for
all three parameters".  It is ambiguous whether the "user-specified
callback" is the ContentHandler registered on the ValidatorHandler, or
whether it's the callback that is made to the ValidatorHandler.

It seems that the ValidationHandler's behaviour when the namespace-prefixes
feature is set to false is underspecified.
will namespace attributes be removed if they're
present?  If they are removed, then the proscription that a
"ValidatorHandler may not remove attributes that were present in the input"
should be modified so that namespace attributes are specifically not
counted as attributes.  Are they added when necessary and the
start/endPrefixMapping callbacks are still not issued?

Namespace attributes, if sent, would be redundant; it seems that the input
a ValidationHandler can expect should be clarified.  Our suggestion is that
applications calling ValidationHandlers should pass namespace information
in start/endPrefixMapping methods.

8.3.7:  mentions supporting DTD's; "must" should not be used in this
context, since DTD support is not required by JAXP 1.3 in the validation
API.

ValidatorHandler.isValidSoFar():  We believe that an application should be
able to determine this trivially--especially with the very strong incentive
provided for the registration of custom error reporters.  It does not make
sense to oblige parsers to always compute somethihng that only a small
number of applications will need, when it would be trivial for the
applications to perform the computation.

ignoreableWhitespace (8.3.9), 4th bullet reads to the effect that if
certain
characters are determined to be ignorable, then ignoreableWhitepsace
callback should be invoked.  "Should" should be replaced by "must", and the
fact that this is only meaningful for DTD's should be clearly stated;
perhaps a link to the Infoset's [element content whitespace] property
should be provided.  Also, this need not be justified with reference to
DocumentBuilders.

javax.xml.parsers:

if setSchema is used, the reader's attention should be emphatically called
to the fact that errors
will generate exceptions if no error handler is registered.

The setSchema() methods for both DocumentBuilderFactory and
SAXParserFactory state that "[w]hen a Schema is non-null, a parser will use
a validator created from it to validate documents before it passes
information on to the application".  We are strongly of the view that this
unacceptably restricts implementors' freedom:  This should be reworded to
read "[w]hen a Schema is non-null, a parser will behave as if it used a
validator created from that Schema to validate documents before it passes
information on to the application".  This will allow an implementor to use
custom, possibly dramatically more efficient, behaviour if the Schema
object registered with the Factory is compatible with the parser objects at
a level lower than JAXP.

It is not clear to us why it should not be possible to register DOM level 3
error reporters/entity resolvers with a DocumentBuilder.  If this were
done, of course, how the new DOM l3 entity resolvers/error resolvers would
interact with old ones would need to be specified.  Further, if seems to us
as if it might be useful for a DocumentBuilderFactory to be able to return
LSParsers and LSSerializers; since the ability to set Schemas on
DocumentBuilderFactories is not duplicated in the DOM, it seems likely that
this functionality would be highly useful to the community.

javax.xml.datatype:

1.3.15:  adding two durations:  normalized is undefined.  It appears to
have a connection with the normalizeWith(Calendar) method, but this should
be made clear.  Normalize is also not mentioned in subtract; this is at
least inconsistent.

Since normalizeWith appears to return a normalized duration, there should
be a method (isNormalized?) that tells an application whether the Duration
object has been normalized with respect to some Calendar.

General:

The javax.xml.(parsers|transform).FactoryFinders are at very considerable
variance with what's been in xml-commons
for a long time; since this represents the product of much debugging from
experience in the field, should the
reference implementation not leverage he Apache code?  It would also seem
prudent for all the factories to leverage a common mechanism for performing
actions like finding classLoaders (i.e., what order to consult Context vs.
system vs. bootstrap classloaders etc.)

Cheers!
Neil
Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  905-413-3519, T/L 969-3519
E-mail:  [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
comments from IBM Xerces-J developers on JAXP 1.3

Reply via email to