Vadim Gritsenko wrote:

Honglin Ye wrote:

Vadim Gritsenko wrote:

Honglin Ye wrote:

By partitioning the database into smaller collections may improve the
performance.



I'm curious, why do you think so? Is there any reason or observation for this?



I am curious too. Suppose we have 4 groups, each group add 1000 files every month.
Some time we only need to search files from one group, sometime we only need to
search for a particular month. Will it be easier if we separate the files by groups
and months such that we can work with smaller groups?



Hm, I guess so. Certainly, search over 1000 files will be faster than search over 1000*4*months files. During the search, Xindice will not go through each file if you are using indexes, but still, I can think that search over smaller collection / smaller indexes should be faster.



It may make little difference in searching?



Suppose you have everything in one collection. Then, you'll need at least three indexes: one for a group, one for a month, and another on an element/attribute you are searching for. Intersection of these three indexes will give you a resulting set of documents. First, month index will return you 4000 documents, then group index will return you 1000*months documents, and then third index will be used. Intersection of results returned by the index will give you resulting set of documents. So Xindice will take some time reading / searching using indexes and building intersection.


OTOH, if you have collection for a group and for a month, you will not need first two indexes, so this will make search faster.


how about update? When we update a rescue, will
xindice directly modify that resource in the tbl and leave other part untouched?



Get document by ID / Set document by ID / perform an XUpdate of the document are all fast operations. Xindice uses BTree to store key -> document association, where each node in the BTree stores (4096 / KeySize) keys. So, for keys 128 bytes in size (32 keys per 4096 page), access speed will be log32(N), where N is count of documents in the collection.


Xindice stores documents in paged file, so when you update a document, only pages containing document will be updated. The rest of the .tbl will not be touched.


Partitioning is an useful strategic in rdms, is it has any similarity in xindice?



If you are planning to have large database (>2Gb) then you'll have to partition due to limits on file size (and this limits varies on different operating systems / file systems).


Either way, let us know how many documents you were able to store and how fast did it work ;-)

Vadim




Vadim,
     The whole document retrieve - modify - store is mostly I needed.
I have a more demanding query requirement. I am building a proposal
submission and handling system for radio telescope systems and required to keep 
docs as xml.
It should be searchable over proposer names, proposal titles, observation
types, telescopes, configurations, target sources, frequencies, and
abstract text etc. From my understanding, there is no performance penalty
to have more indexes. It also looks like that as new proposals add in, the
indexes be generated automatically, is that true?
      Do you have pointers to the systems like I am building that uses
xml as storage? Do you have suggestions how should I proceed?

Honglin




Reply via email to