RE: [RT] Xindice 2.0

Lixin Meng 28 Nov 2002 01:19:09 -0000

> Do you mean that there might be a use case for a metadata
> that returns
> the *whole* database content? What would happen on a database with
> millions of documents? Is this feature available in RDBMS and JDBC? I
> assume that you want to "clone" something like "SELECT * FROM CAT" or
> "SHOW TABLES", am I right? If so, those commands will return you the
> tables (in our case, roughly speaking, the Collections) but
> never ever
> the whole data. Sorry if I'm not getting the point, but I
> feel a bit lost...


Sorry confused (might even scare?) you. It is definitely not the whole
database content. Only the *meta* information about the structure. It is
more like in RDBMS, you can get the db schema from its system tables (like
'select table_name from user_tables' for Oracle). For RDBMS, one may only
need to know tables and fields information. For XML database, to build a
global meta tree will be much deeper and expensive than that. Therefore,
some rules need to be introduced. Like only follow up to n-levels or ignore
'*/HTML/*', for example.

A database with millions of documents doesn't mean the meta tree will have
millions of nodes. Otherwise, whoever using the database in this way just
treat the database as a dump ground. No system can help that.


> The user possibly doesn't, but we definitely do. :-) Imagine
> we have a
> tree like this ("/" are Collections, "*" are Resources, "<>"
> are nodes
> in Resources):
>

I like the notation.

> /
> |
> +--/USA
>       +--*Statistics
>       |     |
>       |     +--<California>
>       |             |
>       |             +--<BayArea>
>       |                   |
>       |                   +--<Temperature>
>       |
>       +--/California
>               |
>               +--*BayArea
>                      |
>                      +--<Temperature>
>
>   How can we decide if Joe user wanted to know the value of
> the element
> <Temperature> on resource "Bayarea" contained inside the
> sub-collection
> "California" or if he wanted to query the USA collection for
> documents
> having an XPath of /California/BayArea/Temperature? Same XPath, but
> definitely different results...

That's the beauty of virtualization. By default, we return both. If you
think the XPath actually represent the semantic meaning of the result, there
is no difference at the semantic level. Also why people want to create or
categorize those collections at the first place? Because they want to give
some meaning to the content. Isn't that the same idea behind those XML tags?
Crazy?

> > On the other hand, if user really want to be specific, they can say
> >     /USA/California[system_type='collection']/...
> > where 'system_type' is the meta information.
>
> A bit clumsy but it might work, yet you would need to specify
> that even
> USA is a collection, so just in case I'd rather go for something like:
>       /collection[name='USA']/collection[name='California']/...
>

You can do it, but as you pointed out, it is just not very user friendly. On
the other hand, it comes handy when you allow user to use any character as
the collection name.

If we consider the *meaning* rather than its physical appearance, you can
just specify it as '/USA'. If you worry about introducing things like
'system_type', in current meta data proposal, we will introduce system
defined attribute names, e.g 'last-modified', any way. Of course, it is
still debating whether to wrap those meta information into XMLObject (?), or
as a separated one.

> relational database: in the end you would end up by using at most a
> handful of tables (while performing horrible and expensive
> JOINs)

I agree one should avoid JOIN at all cost. If one want to build a DOM tree
in RDBMS, JOIN will be inevitable (that's why I have some reservations over
eXist). The preliminary idea in my previous email is not to build the DOM
tree in order to minimize the JOINs, with the price paid to prepare those
XPaths when inserting the document (kind of like a index). Of course, this
may make one table particularly huge, but RDBMS is designed to handle
millions records in any table. Also, as I said, it has problem for returning
a sub-tree instead of the whole file. Therefore, I think it will be more
suitable for situations that has less write or update, but prefer a faster
query.

> to mention the overhead for serializing XML to SQL and SQL to
> XML. Add
> to this the network latency and you're set with a possibly suboptimal
> setup. On the other hand, nce you manage to have a tabular output you
> can use hashes, arrays and the like, so any DBM would
> suffice. Don't you
> think so?

First, the output is not a tabular. Each record still returns the original
XML file which can be a BLOB format in database. If the 'network latency' is
referring to cost associated with JDBC connections, I guess it can be
ignored at this stage, if we talk about minute-level query as some users
reported. There are so many optimizations have already been done by those
RDBMS forks, and we need to start from thinking about if our B-Tree is
balanced. Do we really need to reinvent the wheel?

Lixin

RE: [RT] Xindice 2.0

Reply via email to