Re: Performance question (Am I doing something wrong?)

Jeff Greif 12 Nov 2002 02:47:12 -0000

I think I see the problem.  I believe (if I've read the source code
correctly) that in xindice, your index on <w> is essentially a map from
values of <w> to document keys.  When the documents mapping matching the
constraint on the index are found, an XPath selection of the desired nodes
in the document is then carried out.  Since your 3 documents in the original
query each had thousands of sentences, and each of the sentences in the
subset of the documents selected by the index had to then be checked to see
if it contained the word in question, this could be very expensive.
Clearly, if I'm not confused, xindice is optimized for small documents with
not much repeating structure, and its indexing mechanism is not optimal for
the type of query you're doing.

A possible organization of your data in xindice that would make the query go
*much* faster is to make each sentence a document, adding to the <s>
elements the document key and the number of the sentence in the document
(perhaps the sentence id already represents this information?).  Instead of
having a collection of documents you would have a collection of sentences.

With this organization you can rapidly find sentences containing a given
word, and you can easily reassemble a document.

Presumably, fancier indexing structures could be devised for larger
documents similar to the kind you've tried here, where the index maps values
to document + unique xpath.  Here's a possibility I'm inventing on the fly.
Suppose all elements were internally assigned unique identifiers within a
document.  You could then use the index to map the value to pairs like
(doc-id,
/[EMAIL PROTECTED]'doc1']/[EMAIL PROTECTED]'body1']/[EMAIL 
PROTECTED]'s63']/[EMAIL PROTECTED]'w7']) and have a
byte-valued (or node-valued) index on uid.  The following procedure would
then rapidly find the right nodes.  The xpath in the tuples retrieved from
the index would be matched against the node selection part of the query
xpath, e.g. /doc/body/s.  This would determine that we're supposed to find
the node with uid='s63', and we could go right to the node in the dom or
byte in a serialized representation.  I don't know if anyone has plans for
such indexing structures in xindice -- they look pretty expensive to
maintain when you allow update, and they would take up a non-trivial amount
of storage.

Jeff
----- Original Message -----
From: "Beni Ruef" <[EMAIL PROTECTED]>
To: <xindice-users@xml.apache.org>
Cc: "Hans Martin Lehmann" <[EMAIL PROTECTED]>; "Sebastian Hoffmann"
<[EMAIL PROTECTED]>
Sent: Monday, November 11, 2002 1:44 PM
Subject: Re: Performance question (Am I doing something wrong?)

> On Sunday, November 10, 2002, at 08:45, Mark J. Stang wrote:
>
> > Not wrong.   If that is what your tests show, then you are not doing
> > anything wrong.   Not if your test is "how long will a command-line
> > query take?"
>
> Got your point, excuse my stupidity ;-)
>
> > Try adding in a thousand such documents and
> > index them.   My guess is that the query time will increase but
> > it will be hard to measure the increase.
>
> That's correct but there's a big but: searching for rare words (as my
> example was) is indeed rather independent of the corpus size but
> searching for more common words is another story (the following numbers
> are with a 30 MB corpus):
> Searching for a word occuring only twice takes ca. 3 seconds, searching
> for a word occuring 13 times takes 70 seconds, and a word occuring 1300
> times takes even more than 6 minutes!
>
> This is a typical characteristic of linguistic corpora: although the
> number of words is virtually unlimited the number of unique word forms
> is quite restricted: one of my corpora has 106'000 words but only 19'000
> different word forms.  It's even worse with word categories or POS tags
> (the 'pos' attributes in my example) as there are just 50 - 100 of
> them.  Technically speaking, the values used as keys in the index are
> far from unique -- I don't know how Xindice can handle this problem.
>
> > The only accurate way to measure the time is to build a small
> > program and test the actual call to the database.   Try using
> > Example1.java, that one has always worked well for me.
>
> I did this and it takes two seconds (measured between getCollection()
> and getIterator()) for my "rare word" query.
>
> > I had a problem with Mac OS X with every query taking
> > forever.  Turned out that the database initialization was being
> > done every time and it took forever.
>
> But why, isn't the Java code identical? (Disclaimer: I know nothing
> about Java ;-)
>
> > The same query was faster on
> > Windows and Linux.
>
> I did my benchmarks on a slow Linux machine (AMD-K6 350 MHz) and they
> take only twice as long as on my (fast :-) iBook.
>
> >  I ended up caching my collections.
>
> How do you do this?
>
> > This is from an e-mail dated September 5, 2001.
> >
> > Kimbro Staken wrote:
> >
> >> computer: 750MHZ P3 256MB RAM Laptop running Mandrake Linux 8
> >> jdk: Sun 1.3.0_04
> >> Dataset size: 149,025 documents 601MB
> >> Insertion time (no indexes): 1 hour 45 minutes which is roughly 1,424
> >> docs
> >> per minute or 24 per second.
> >> Collection size: 657MB
> >> Document retrieval: 2 seconds (including VM startup which is most of
> >> the
> >> time)
> >> Full collection scan query /disc[id = '11041c03']: 12 minutes
> >> Index creation: 13.5 minutes
> >> Index based query /disc[id = '11041c03']: 2.12 seconds (including VM
> >> startup which is most of that time)
>
> I'd really like such a fast response :-) but these are unique keys...
>
> Cheers
> -Beni
>
>

Re: Performance question (Am I doing something wrong?)

Reply via email to