Brendan Laing wrote:
Hi,

I've been using xindice for a few weeks now and have started to puzzle
over the follow question. If I have many xml documents and store each
in a collection I'll have a disk space problem due to the 4MB tbl
files  (like a similar user who posted a mail on 2006-11-08 15:24:51
titled 'pagesize/pagecount change in beta4').

If you store one document per collection, that is a wrong approach. In the XML:DB database, collection is intended to store lots of documents. It is similar to how single RDBMS table stores multiple records.


However if I aggregate xml documents into a collection I'm concerned
about issues such as record locking and performance. To discuss the
issue let's suppose the following:

1) I have 10k xml documents, 5k on domain xyz.com and 5k on abc.com.
Obviously each domain will have a xindice server running and therefore
a collection per domain would be required at least.

If you have two xindice servers, they should be two different server installations, with separate config.xml file and must have separate directories for the database files.

Multiple xindice servers must not ever share same database files.

You should either have one xindice server with multiple collections, or multiple servers (with one or many collections - whatever suits your needs).


2) An application sits on xyz and accesses documents via the embeded
interface. Each read or update opens the collection and closes it.

You don't have to open/close collection for each operation. Collection can be opened once and used by multiple threads and closed on application shutdown. Collection opening/closing in the client API does not cause collection opening/closing in the database itself.


If
thread a opens the collection and thread b tries to access it and
close it before a has finished will we experience locking
synchronisation issues?

No.


Or is locking at node level in the BTree?

There is no locking implemented in the xindice (one client can not prevent another from modifying a document), but there is a synchronization (prevents data corruption when multiple threads are writing to database). It is done on levels deeper than CollectionImpl classes.


3) The application on xyz accesses documents on abc.com over http (via
the xml_rpc interface). We naturally try to reduce network traffic and
bundle updates to improve response times (the cost of the xml_rpc
exceeds the java applications at each end). However by using a single
collection (that is continually opened and closed?) versus many
smaller collections do we incur a penalty for reading and writing a
larger document, parsing it in and out of xindice?

Lots of smaller collections will require more operating system resources (such as file descriptors). Smaller collections are also harder to query: there is no cross collection querying implemented by xindice. Parsing of the document from small or large collection will take exactly same amount of time.

Vadim

Reply via email to