Not wrong.   If that is what your tests show, then you are not doing
anything wrong.   Not if your test is "how long will a command-line
query take?"   The question is, is that your entire application and
use of Xindice.   Try adding in a thousand such documents and
index them.   My guess is that the query time will increase but
it will be hard to measure the increase.

Xindice uses a binary tree look up.   Even if it didn't, on your
machine you could open the file, read the entire document
into memory and use a java sub-string search in less time.

The only accurate way to measure the time is to build a small
program and test the actual call to the database.   Try using
Example1.java, that one has always worked well for me.

I had a problem with Mac OS X with every query taking
forever.  Turned out that the database initialization was being
done every time and it took forever.   Ran the same program
on a single processor Mac and a dual-processor Mac.   One
was fast the other slow.  The same query was faster on
Windows and Linux.   I ended up caching my collections.

This is from an e-mail dated September 5, 2001.

Kimbro Staken wrote:

> As I've been working out some issues with the CORBA system I've been
> working on getting larger document sets into the server. My largest set
> right now is 149,025 documents in a single collection. The server can
> easily handle more documents this is just the largest dataset I have
> available right now. Here are some stats to give us a better idea where we
> stand. These are run against the current CVS version with one exception. I
> used OpenORB for the server ORB  instead of JacORB. JacORB was still used
> for the client. It's likely we'll need to switch to OpenORB overall as
> even the latest JacORB leaks memory on the server.
>
> computer: 750MHZ P3 256MB RAM Laptop running Mandrake Linux 8
> jdk: Sun 1.3.0_04
> Dataset size: 149,025 documents 601MB
> Insertion time (no indexes): 1 hour 45 minutes which is roughly 1,424 docs
> per minute or 24 per second.
> Collection size: 657MB
> Document retrieval: 2 seconds (including VM startup which is most of the
> time)
> Full collection scan query /disc[id = '11041c03']: 12 minutes
> Index creation: 13.5 minutes
> Index based query /disc[id = '11041c03']: 2.12 seconds (including VM
> startup which is most of that time)
> Index size 164MB
>
> The data set consists of documents similar to the following.
>
> <?xml version="1.0"?>
> <disc>
> <id>11041c03</id>
> <length>1054</length>
> <title>Orchestral Manoeuvres In The Dark / The OMD Remixes (Single)</title>
> <genre>cddb/misc</genre>
> <track index="1" offset="150">Enola Gay (OMD vs Sash! Radio Edit)</track>
> <track index="2" offset="18790"> (2)Souvenir (Moby Remix)</track>
> <track index="3" offset="39790"> (3)Electricity (The Micronauts
> Remix)</track>
> </disc>
>
> Kimbro Staken
> The dbXML Group L.L.C. - http://www.dbxmlgroup.com/
> Embedded XML Database Software and Services
>
> _______________________________________________
> dbXML-Core-Devel mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/dbxml-core-devel

HTH,

Mark

Beni Ruef wrote:

> I just installed Xindice 1.0 on my iBook under Mac OS X (with enough
> RAM, i.e. a reasonably fast machine).
>
> The (TEI encoded) texts I'm interested in look like this:
>
>      .
>      .
>      <s n="id-1.1"><w pos="DD1">This</w> <w pos="VBZ">is</w> <w
> pos="AT1">a</w><w pos="NN1">sentence</w><c pos="YSTP">.</c></s>
>      <s n="id-1.2"><w pos="DD1">This</w> <w pos="VBZ">is</w> <w
> pos="DD1">another</w><w pos="PN1">one</w><c pos="YSTP">.</c></s>
>    </body></text>
> </TEI.2>
>
> and the simplest queries like this:
>
>      xindice xpath -c /db/myCollection -q
> '/TEI.2/text/body/s[w="another"]'
>
> With a test corpus containing three documents and a total of ca. 700
> KBytes the above query takes ca. 9 seconds...
> More surprisingly, even a simple retrieval (xindice rd) of a 200K
> document needs 5 seconds!
>
> After having run
>
>      xindiceadmin ai -c /db/myCollection -p w -n wordform
>
> things improve slightly as the same query takes now 4.5 seconds, a query
> with a non-existing word form "only" 2.5 seconds.  BTW, there seems to
> be an overhead of ca. 2 seconds as any operation takes at least 2
> seconds...
>
> Obviously, this is still way too slow to be usable as I'm planning to
> work with corpora containing some 100 million words...
>
> So what am I doing (terribly ;-) wrong and what can be improved?!  What
> about this overhead and what about the indexer switches like pagesize?
>
> Thanks in advance, Cheers
> -Beni

--
Mark J Stang
System Architect
Cybershop Systems

begin:vcard 
n:Stang;Mark
x-mozilla-html:TRUE
adr:;;;;;;
version:2.1
email;internet:[EMAIL PROTECTED]
fn:Mark Stang
end:vcard

Reply via email to