Not wrong. If that is what your tests show, then you are not doing anything wrong. Not if your test is "how long will a command-line query take?" The question is, is that your entire application and use of Xindice. Try adding in a thousand such documents and index them. My guess is that the query time will increase but it will be hard to measure the increase.
Xindice uses a binary tree look up. Even if it didn't, on your machine you could open the file, read the entire document into memory and use a java sub-string search in less time. The only accurate way to measure the time is to build a small program and test the actual call to the database. Try using Example1.java, that one has always worked well for me. I had a problem with Mac OS X with every query taking forever. Turned out that the database initialization was being done every time and it took forever. Ran the same program on a single processor Mac and a dual-processor Mac. One was fast the other slow. The same query was faster on Windows and Linux. I ended up caching my collections. This is from an e-mail dated September 5, 2001. Kimbro Staken wrote: > As I've been working out some issues with the CORBA system I've been > working on getting larger document sets into the server. My largest set > right now is 149,025 documents in a single collection. The server can > easily handle more documents this is just the largest dataset I have > available right now. Here are some stats to give us a better idea where we > stand. These are run against the current CVS version with one exception. I > used OpenORB for the server ORB instead of JacORB. JacORB was still used > for the client. It's likely we'll need to switch to OpenORB overall as > even the latest JacORB leaks memory on the server. > > computer: 750MHZ P3 256MB RAM Laptop running Mandrake Linux 8 > jdk: Sun 1.3.0_04 > Dataset size: 149,025 documents 601MB > Insertion time (no indexes): 1 hour 45 minutes which is roughly 1,424 docs > per minute or 24 per second. > Collection size: 657MB > Document retrieval: 2 seconds (including VM startup which is most of the > time) > Full collection scan query /disc[id = '11041c03']: 12 minutes > Index creation: 13.5 minutes > Index based query /disc[id = '11041c03']: 2.12 seconds (including VM > startup which is most of that time) > Index size 164MB > > The data set consists of documents similar to the following. > > <?xml version="1.0"?> > <disc> > <id>11041c03</id> > <length>1054</length> > <title>Orchestral Manoeuvres In The Dark / The OMD Remixes (Single)</title> > <genre>cddb/misc</genre> > <track index="1" offset="150">Enola Gay (OMD vs Sash! Radio Edit)</track> > <track index="2" offset="18790"> (2)Souvenir (Moby Remix)</track> > <track index="3" offset="39790"> (3)Electricity (The Micronauts > Remix)</track> > </disc> > > Kimbro Staken > The dbXML Group L.L.C. - http://www.dbxmlgroup.com/ > Embedded XML Database Software and Services > > _______________________________________________ > dbXML-Core-Devel mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/dbxml-core-devel HTH, Mark Beni Ruef wrote: > I just installed Xindice 1.0 on my iBook under Mac OS X (with enough > RAM, i.e. a reasonably fast machine). > > The (TEI encoded) texts I'm interested in look like this: > > . > . > <s n="id-1.1"><w pos="DD1">This</w> <w pos="VBZ">is</w> <w > pos="AT1">a</w><w pos="NN1">sentence</w><c pos="YSTP">.</c></s> > <s n="id-1.2"><w pos="DD1">This</w> <w pos="VBZ">is</w> <w > pos="DD1">another</w><w pos="PN1">one</w><c pos="YSTP">.</c></s> > </body></text> > </TEI.2> > > and the simplest queries like this: > > xindice xpath -c /db/myCollection -q > '/TEI.2/text/body/s[w="another"]' > > With a test corpus containing three documents and a total of ca. 700 > KBytes the above query takes ca. 9 seconds... > More surprisingly, even a simple retrieval (xindice rd) of a 200K > document needs 5 seconds! > > After having run > > xindiceadmin ai -c /db/myCollection -p w -n wordform > > things improve slightly as the same query takes now 4.5 seconds, a query > with a non-existing word form "only" 2.5 seconds. BTW, there seems to > be an overhead of ca. 2 seconds as any operation takes at least 2 > seconds... > > Obviously, this is still way too slow to be usable as I'm planning to > work with corpora containing some 100 million words... > > So what am I doing (terribly ;-) wrong and what can be improved?! What > about this overhead and what about the indexer switches like pagesize? > > Thanks in advance, Cheers > -Beni -- Mark J Stang System Architect Cybershop Systems
begin:vcard n:Stang;Mark x-mozilla-html:TRUE adr:;;;;;; version:2.1 email;internet:[EMAIL PROTECTED] fn:Mark Stang end:vcard