Tom Bradford wrote: [...] > There are a few of things that need to be addressed in future revisions > of Xindice. I'll run through them very quickly, and then I'd like to > hear people's feedback. > [...] > Schema support > ----------------------- > We need to support schemas in an abstracted fashion. If we can > architect a content model API that would allow the system to validate > and operate against a content model without needing to know that the > content model is based on XML Schemas or Relax NG, that would be ideal.
Why in Xindice? There are several places where validation can occur: 1. upon storing in the database; 2. following an XUpdate; and 3. upon retrieving content from the database. In all three cases, the DOM nodes to be validated are already available to the developer outside of Xindice and can be validated using existing validation tools and techniques. Not that I'm going to fight the issue, but I'm rather against including support for schema validation within the Xindice, as this is an application- level issue (as I've described in previous messages). There are many different types of schema validation, and different validation needs, eg., different levels of strictness or different content validation at various places within a processing regimen. Validation is a complicated issue that doesn't have a one-size-fits-all type of solution. There are a plethora of validation options out there and I don't see that one API could serve the variety of schema languages, structure and content validation needs that would be within a reasonable scope of effort. You'd be tackling the same issues that the W3C Schema WG tackled, with the "data heads" and "document heads" needs on the table. Perhaps I'm just being daft, but I've never followed the reasoning on why anyone would *need* to include further validation functionality *within* XIndice. It only seems to add redundant complexity to the package. Those who know my history in the markup field know I'm a big advocate of validation, but this is one place I wouldn't support its inclusion, code-wise. If anyone is unclear as to how to provide Java-based XML validation within their application, I'd be happy to suggest several books and available software packages. In passing I should mention that Sun has released binaries and source code for a Multi-Schema XML Validator, which we demoed at XML One. This tool can work on the command line, can be integrated into applications, and can even act as a SAX pipe validator. http://www.sun.com/software/xml/developers/multischema/ It supports RELAX NG, RELAX Namespace, RELAX Core, TREX, XML DTDs, and a subset of XML Schema Part 1. Of course, DTD and XML Schema validation is built into Xerces 2, which is already part of the Xindice distribution. > Context-sensitive indexing > ------------------------------------ > XML Schemas introduces the idea of contextually-dependant typing. What > this means is that for any particular schema, that schema may use the > same element name in more than one scope, and assign to that element > name a completely different primitive type for each scope. So in one > scope, it may be an int, while in another it may be a string, or even a > complex structure. > > Xindice's indexing system was originally design when DTDs were the only > standard way of representing an XML schema, and in DTDs, an element name > is globally unique. So we need to rearchitect the indexing system to > support the ability for attaching a particular index to a schema > context. I have some vague ideas of how to do this, but I'd like to get > a user's perspective on how you'd like to see this made available. I don't see how this could be done reasonably without hooking deeply into the XML Schema support code that's in Xerces in order to be certain that the same context was arrived at by both an application and Xindice. That is, I believe this would be necessary unless one believes that Xerces' XML Schema support will provide the same context under all circumstances as other XML Schema tools. I'm skeptical about that. At least we'd be bug-for-bug compatible with other Xerces-based applications. > Large Documents and Document Versioning > ------------------------------------------------------------ > Xindice needs to be capable of supporting massive documents in a > scalable fashion and with acceptable performance. Currently, the > document representation architecture is based on a tokenized, lazy DOM > where the bytestream images that feed the DOM are stored and retrieved > in a paged filing system. Every document is treated as an atomic unit. > This has some serious limitations when it comes to massive documents. > > In order to support very large documents, the tokenization system needs > to be replaced and geared more toward the simplified representation of > document structure rather than an equal balance of structure and > content. Also, the Filer interfaces need to support the notion of > streaming, and even more importantly, the ability to support random > access streaming. > > Also, the tokenization system needs to support versioning in one way or > another. For small documents, complete document revision links or > permissible, but for massive documents, there's no way that versioning > of that nature is acceptible. So, the tokenization system needs to > understand the notion of versioned linking. > > The DTSM stuff that I started working on will help with the massive > document problem, but we'd need to introduce the versioning concept into > the specification as well. I'm likely to be tackling something akin to this in the next few months, trying to hook up javacvs (the netbeans.org version, not the sourceforge one which is under GPL) to Xindice. I don't have much of a need for large document support, but the approach I'd take would be perhaps useful in that regard. Basically, content would be checked into javacvs prior to being stored in Xindice, hence most revision control issues are handled outside of the database. I would not be attempting node-based revision control support (ie., as Tom said above, support within the tokenization system), which would be very valuable but outside the scope of effort I'm willing to take on. If someone is willing to do the node-based RCS within Xindice, I'm quite happy to step aside. Murray ........................................................................... Murray Altheim, Staff Engineer <mailto:murray.altheim@sun.com> Java and XML Software Sun Microsystems, 1601 Willow Rd., MS UMPK17-102, Menlo Park, CA 94025 Ernst Martin comments in 1949, "A certain degree of noise in writing is required for confidence. Without such noise, the writer would not know whether the type was actually printing or not, so he would lose control."