Tom Bradford wrote:
[...]
> There are a few of things that need to be addressed in future revisions
> of Xindice.  I'll run through them very quickly, and then I'd like to
> hear people's feedback.
> 
[...] 
> Schema support
> -----------------------
> We need to support schemas in an abstracted fashion.  If we can
> architect a content model API that would allow the system to validate
> and operate against a content model without needing to know that the
> content model is based on XML Schemas or Relax NG, that would be ideal.

Why in Xindice? There are several places where validation can occur:
1. upon storing in the database; 2. following an XUpdate; and 3. upon
retrieving content from the database. In all three cases, the DOM nodes
to be validated are already available to the developer outside of 
Xindice and can be validated using existing validation tools and techniques.

Not that I'm going to fight the issue, but I'm rather against including
support for schema validation within the Xindice, as this is an application-
level issue (as I've described in previous messages). There are many 
different types of schema validation, and different validation needs, eg.,
different levels of strictness or different content validation at various 
places within a processing regimen. Validation is a complicated issue that
doesn't have a one-size-fits-all type of solution.

There are a plethora of validation options out there and I don't see that
one API could serve the variety of schema languages, structure and content
validation needs that would be within a reasonable scope of effort. You'd 
be tackling the same issues that the W3C Schema WG tackled, with the 
"data heads" and "document heads" needs on the table. 

Perhaps I'm just being daft, but I've never followed the reasoning on
why anyone would *need* to include further validation functionality
*within* XIndice. It only seems to add redundant complexity to the 
package. Those who know my history in the markup field know I'm a
big advocate of validation, but this is one place I wouldn't support
its inclusion, code-wise. If anyone is unclear as to how to provide
Java-based XML validation within their application, I'd be happy to
suggest several books and available software packages.

In passing I should mention that Sun has released binaries and source 
code for a Multi-Schema XML Validator, which we demoed at XML One. This
tool can work on the command line, can be integrated into applications, 
and can even act as a SAX pipe validator.

   http://www.sun.com/software/xml/developers/multischema/
 
It supports RELAX NG, RELAX Namespace, RELAX Core, TREX, XML DTDs, and a
subset of XML Schema Part 1. Of course, DTD and XML Schema validation is 
built into Xerces 2, which is already part of the Xindice distribution. 

> Context-sensitive indexing
> ------------------------------------
> XML Schemas introduces the idea of contextually-dependant typing.  What
> this means is that for any particular schema, that schema may use the
> same element name in more than one scope, and assign to that element
> name a completely different primitive type for each scope.  So in one
> scope, it may be an int, while in another it may be a string, or even a
> complex structure.
> 
> Xindice's indexing system was originally design when DTDs were the only
> standard way of representing an XML schema, and in DTDs, an element name
> is globally unique.  So we need to rearchitect the indexing system to
> support the ability for attaching a particular index to a schema
> context.  I have some vague ideas of how to do this, but I'd like to get
> a user's perspective on how you'd like to see this made available.

I don't see how this could be done reasonably without hooking deeply
into the XML Schema support code that's in Xerces in order to be
certain that the same context was arrived at by both an application
and Xindice. That is, I believe this would be necessary unless one
believes that Xerces' XML Schema support will provide the same context
under all circumstances as other XML Schema tools. I'm skeptical about 
that. At least we'd be bug-for-bug compatible with other Xerces-based
applications.

> Large Documents and Document Versioning
> ------------------------------------------------------------
> Xindice needs to be capable of supporting massive documents in a
> scalable fashion and with acceptable performance.  Currently, the
> document representation architecture is based on a tokenized, lazy DOM
> where the bytestream images that feed the DOM are stored and retrieved
> in a paged filing system.  Every document is treated as an atomic unit.
> This has some serious limitations when it comes to massive documents.
> 
> In order to support very large documents, the tokenization system needs
> to be replaced and geared more toward the simplified representation of
> document structure rather than an equal balance of structure and
> content.  Also, the Filer interfaces need to support the notion of
> streaming, and even more importantly, the ability to support random
> access streaming.
> 
> Also, the tokenization system needs to support versioning in one way or
> another.  For small documents, complete document revision links or
> permissible, but for massive documents, there's no way that versioning
> of that nature is acceptible.  So, the tokenization system needs to
> understand the notion of versioned linking.
> 
> The DTSM stuff that I started working on will help with the massive
> document problem, but we'd need to introduce the versioning concept into
> the specification as well.

I'm likely to be tackling something akin to this in the next few months,
trying to hook up javacvs (the netbeans.org version, not the sourceforge
one which is under GPL) to Xindice. I don't have much of a need for large
document support, but the approach I'd take would be perhaps useful in
that regard. Basically, content would be checked into javacvs prior to being
stored in Xindice, hence most revision control issues are handled outside
of the database. I would not be attempting node-based revision control 
support (ie., as Tom said above, support within the tokenization system),
which would be very valuable but outside the scope of effort I'm willing 
to take on. If someone is willing to do the node-based RCS within Xindice,
I'm quite happy to step aside.

Murray

...........................................................................
Murray Altheim, Staff Engineer          <mailto:murray.altheim&#64;sun.com>
Java and XML Software
Sun Microsystems, 1601 Willow Rd., MS UMPK17-102, Menlo Park, CA 94025

       Ernst Martin comments in 1949, "A certain degree of noise in 
       writing is required for confidence. Without such noise, the 
       writer would not know whether the type was actually printing 
       or not, so he would lose control."

Reply via email to