Last Friday, I formally resigned from my position as Chief Architect of the dbXML Group, and so I am now a free agent. I am about to take a job with a company in the Bay Area, and will be relocating there shortly after that. This new position may or may not afford me the ability to continue working on Xindice with the amount of attention I devote to it now, so we need to start taking steps in order to make sure that the project continues to evolve if the situation is such that I can't do a lot of the coding any more.

There are a few of things that need to be addressed in future revisions of Xindice. I'll run through them very quickly, and then I'd like to hear people's feedback.

Wire Protocol changes
-------------------------------
These have been widely mentioned, but we need to start moving away from CORBA and supporting a more flexible wire protocol system with Xindice. I'd propose to use my Labrador framework to provide this functionality, as I've already experimented with it, and it works rather well.


Schema support
-----------------------
We need to support schemas in an abstracted fashion. If we can architect a content model API that would allow the system to validate and operate against a content model without needing to know that the content model is based on XML Schemas or Relax NG, that would be ideal.


Context-sensitive indexing
------------------------------------
XML Schemas introduces the idea of contextually-dependant typing. What this means is that for any particular schema, that schema may use the same element name in more than one scope, and assign to that element name a completely different primitive type for each scope. So in one scope, it may be an int, while in another it may be a string, or even a complex structure.


Xindice's indexing system was originally design when DTDs were the only standard way of representing an XML schema, and in DTDs, an element name is globally unique. So we need to rearchitect the indexing system to support the ability for attaching a particular index to a schema context. I have some vague ideas of how to do this, but I'd like to get a user's perspective on how you'd like to see this made available.


Large Documents and Document Versioning
------------------------------------------------------------
Xindice needs to be capable of supporting massive documents in a scalable fashion and with acceptable performance. Currently, the document representation architecture is based on a tokenized, lazy DOM where the bytestream images that feed the DOM are stored and retrieved in a paged filing system. Every document is treated as an atomic unit. This has some serious limitations when it comes to massive documents.


In order to support very large documents, the tokenization system needs to be replaced and geared more toward the simplified representation of document structure rather than an equal balance of structure and content. Also, the Filer interfaces need to support the notion of streaming, and even more importantly, the ability to support random access streaming.

Also, the tokenization system needs to support versioning in one way or another. For small documents, complete document revision links or permissible, but for massive documents, there's no way that versioning of that nature is acceptible. So, the tokenization system needs to understand the notion of versioned linking.

The DTSM stuff that I started working on will help with the massive document problem, but we'd need to introduce the versioning concept into the specification as well.


Paged Files and BTrees
---------------------------------
Nodes that are stored by Paged files are currently materialize in their entirety, even if all of their content isn't needed. Originally, it was written like this because I wanted to nail down functionality. In a language like C++ or C, this is not an issue because you point a struct pointer to an offset into your buffer, and voila, you're done, but in Java, it requires a lot of conversion. For Java, it may improve performance quite a bit if node portions (such as BTree node pointer and value lists) were materialize only on demand rather than as a whole. Obviously, this would require some research to determine if my guess is true or not.


--
Tom Bradford - http://www.tbradford.org
Developer - Apache Xindice (Native XML Database) - http://xml.apache.org
Creator - Project Labrador (Web Services Framework) - http://notdotnet.org


Reply via email to