RE: Xindice scalability: using in a large bio

Singh, Atul 20 Dec 2002 17:44:22 -0000

I have done a bit of such testing. I populated a collection with very small documents. An example document is attached. The performance deteriorates heavily once we have more than 3000 such documents. Xindice throws out of memory exception on my machine.

The database works fairly well till 1000 documents. There are no problems in retrieving a document if I know its id. The time for retrieval is of the order of less than 50 milli s. The time for a query which will spit all the contents of the collection(eg //story) will take approximately 6 seconds. The time for such a query doubles to 12 sec approximately for a collection with 200 stories which is a pain.

We are going for an approach of splitting our collections into smaal sub collections. I have yet to test whether that will be effective.

cheers

Atul Singh

<?xml version="1.0" encoding="UTF-8"?>
<story><submitter><name></name><email></email><date></date></submitter>
<pm><name>not assigned</name><date>not assigned</date><comments>none</comments></pm>
<engineering><team></team><cost></cost><one></one><two></two></engineering>
<reviewer></reviewer>
<status></status>
<storyname></storyname>
<details></details>
<usecase></usecase>
<notes></notes>
<customer></customer>
<theme></theme>
<docImpact>false</docImpact>
<businessValue></businessValue>
<project>None</project>
<queue></queue>
<iteration></iteration>
<priority></priority>
<estimate>not assigned</estimate>
<inform>false</inform>
<dateactive></dateactive>
<dateclosed></dateclosed>
</story>

-----Original Message-----
From: Gudmundur Arni Thorisson [mailto:[EMAIL PROTECTED]
Sent: Friday, December 20, 2002 3:12 PM
To: [email protected]
Subject: Xindice scalability: using in a large bio

Hey, folks. This is my first message to this mailing list. I work in a

non-profit biological research lab (Cold Spring Harbor Laboratory

http://www.cshl.org). Our lab recently became a part of a rather large,

international collaboration, the Haplotype Map project, to product

enormous amounts of biological data (genotypes). Our role will be to

synchronize data handling and build the database to hold the stuff (see

http://www.genome.gov/page.cfm?pageID=10001688 if you're interested in

this thing), plus related tasks.

We'd originally planned to use a fairly XML-centric approach from the

ground up, for a multitude of reasons. One very strong reason was that we

could use the allegedly powerful XML-capabilities of Oracle 9i XMLDB to

produce a XML-relational schema from our own data handling/exchange XML

Schema definitions.

To cut this story short, we got our funding cut down quite a bit and

will now not be able to afford the big-buch Oracle licenses we'd need

(3x$15.000). Open source is now pretty much the only option for us

database-wise, which is in fact a blessing in disguise because it will

make the endevour entirely open source (Oracle would have been the only

proprietary component otherwise).

As of now, we are investigating open source alternatives in either the XML-on-top-of-RDMS arena or the native XML database arena. There are some commercial offerings (Tamino, Ipedo), but as I said above, the preference is always open source. Our lab is very fond of the Apache/mod_perl world (I work for Dr. Lincoln Stein, a longtime guru in the Perl world) and we'll likely use some of the Apache XML project components (Xalan, Xerces) for XML-processing in this project.

As a part of this investigation, Xindice appeared on the horizon. After looking at some simple commandline examples etc. of how one goes about handling XML-documents and collections in Xindice, it looks we just might be able to use the thing. There is only one thing that we have concerns about: scalability. It says on the Xindice website that the db is designed for many, small documents. The XML dataset that we will be handling will contain fairly small documents but VERY many of them; up to 400 million instances of the most populous record class.

My question is therefore this: has anyone used/tested Xindice with datasets of this size (hundreds of millions) with decent performance as well? This will be mainly import + query work, hardly any heavy updating load, if that would make a difference as far as performance goes.

Thanks in advance of your reply. Regards,

Mummi, Cold Spring Harbor Laboratory

PS I have attached key XML Schema draft component files for the portion of our total schema that has been nailed down mostly so far, plus one (hapmap.xsd) that ties them all together. There will be maybe 2 or 3 times this many types of objects in the total schema. The file genotype.xsd defines the <400M record class.

RE: Xindice scalability: using in a large bio

Reply via email to