Re: [RT] Xindice 2.0

Gianugo Rabellino 27 Nov 2002 18:24:51 -0000

Lixin Meng wrote:

>- metadata: we need a neutral way to query metadata for
>collections and
>resources. I like David's solution of having a MetaData object with a


I also hope we can have metadata at the database level.
http://marc.theaimsgroup.com/?l=xindice-dev&m=103790372009713&w=2

Can you be more specific on that? I saw the message on the archive, but I fail to see how would Database metadata help here. I tend to think that Database metadata are capabilities (like transaction support) and maybe the collection tree, nothing more really.

As per XPath queries sent on the database, I understand that they might be useful, but I see a problem. Given an XPath like /db/content/whatever/A/B, how can you tell which one of the tokens is a collection, which one is a document and wich one is a real XML XPath? This would become even more difficult with XPaths like //*/A/B. But I'd be happy to be proven wrong, since I see lots of use cases for that.

>2. PERFORMANCE
>Face it: we are slow. We are fair enough for small jobs but we cannot
>stand high loads or huge documents, no matter how accurate
>your indexes
Xindice has its own B-Tree files for data storage and search. Could we consider leveraging existing RDBM systems? RDBM has been developed and fine tuned for so many years, and they have solved many issues that we are going to tackle (performance, transaction, and security).

Here I disagree. My point is that XML database should solve the problem of semistructured data. Pushing semistructured data on a relational DB looks at least suboptimal to me. I can see a reason when dealing with data oriented XML (like just tags an attributes), but things become really messy on text oriented documents: how could you efficiently break into a tabular format something like

 This is a text. There are text nodes all over the place: I dare you to insert this stuff <emphasis>efficiently</emphasis> in a <a href="http://www.mysql.com";>relational database</a>. 

Besides, I see no real reason to follow that path since there is another Open Source XML database (eXist) who's doing exactly that (not to mention that every database vendor has its own XML->DB engine). As a side matter, actually, I'd love to see the two projects merge together but it looks like it's not the right timing: yet Wolfgang and his team have my total appreciation for the job they are doing.

If we were to chose that path (tabular XML), I would actually investigate more on the forthcoming XML database from the Sleepycat guys: this way, as MySQL uses Berkeley DB for storage, we might leverage Berkeley XML DB.

But then again I'm starting to ask myself if we need a storage at all. I know it sounds provoking, but try to follow me and my crappy English on those two use cases:

1. Use case 1: we are asked a particular resource (say an XML document), and all we need to do is find it and deliver it *as fast as possible* to the user. This means that all we need to do is try to reduce bottlenecks and, apart from network bottlenecks, the only real limitation that I see is *parsing*: if we are to parse a file, then deliver it to a client over the network in a form that in turn needs some kind of parsing, we are just wasting our time. As of now we are dealing with DOM, which is the most expensive and slow XML data structure around. I am currently looking at DTM from Xalan (which however is showing some serious limitations) and I'm willing to try the SAX events compilation way "a' la Cocoon", where all you have is just a byte stream containing "recorded" SAX events. All we need to do is:

a. when writing a document, write it on disk as a byte stream of compiled events;

b. when we are requested a document, just send that byte stream over the network to the client;

c. let the client perform the reverse operation, by interpreting (playing back) the recorded SAX events (possibly to a DOM builder if the client application is requesting a DOM tree.

2. Use case 2: we are requested an XPath query (or, in the future, XQuery). Here we need to have real fast indexes and a real fast XPath engine. Here Xalan DTM might play a key role.

Now, I know that there are more (write oriented) use cases such as XUpdate, Sixdml and the like, but I still think that those kind of operations might be accomplished in an higher timeframe. Again, the parallel with LDAP stands: LDAP writes are *slow* but reads are *blazingly fast*. Not to mention that there might be a way to optimize that part too.

How does it sound? Crazy? :-)

Ciao,

--
Gianugo Rabellino

Re: [RT] Xindice 2.0

Reply via email to