Greets,

A week ago, a revamped indexer based on my Perl search engine library, KinoSearch, successfully built a Lucene-compatible index. The corpus was 1,000 documents from Wikipedia.

Better, it did so in a reasonable amount of time:

Time to index 1000 docs on my G4 laptop
=======================================
Plucene 1.25                   270 secs
KinoSearch 0.05_02              20 secs
Java Lucene                      9 secs

There are a number of fundamental architectural differences between KinoSearch and Lucene, and by extension between KinoSearch and Plucene, which is largely a faithful port of Java Lucene. The most important of these is the merge model, which I plan to address in a separate post, but briefly: Lucene builds miniature-inverted-indexes for each document, then merges them into ever larger indexes on a schedule determined by mergeFactor. KinoSearch builds indexes one segment at a time, and no coherent mini-inverted-index ever exists which is smaller than a segment.

Two other important differences:

1) KinoSearch requires that all fields be defined in advance when creating a segment. The Documents which you add may not contain fields which have not been declared, and you cannot update the definition of a field once it is set. Segments with differing field defs can be reconciled -- you just can't change up a def in the middle of creating a segment. Additionally, KinoSearch will not merge fields with the same fieldname -- it will overwrite. Insisting on rigid field definitions means that the KinoSearch equivalents of FieldInfos, DocumentWriter, FieldsWriter, FieldInfosWriter, TermInfosWriter and such can all be instantiated once per segment, rather than once per document; in Perl, with its comparatively sluggish OO implementation, that adds up.

2) Analyzers in KinoSearch deal with batches of tokens rather than streams. The concept of a TokenStream simply does not translate into efficient Perl.

It may be possible to squeeze more speed out of this indexer, but that's no longer a top priority. Top priority is: adapt KinoSearch's search modules to work with the Lucene file format. After that, the goal will be to implement a limited, maintainably small subset of Lucene's functionality. For instance, I only plan to support composite indexes written by Lucene 1.9 (or whatever version starts writing valid UTF-8) or later.

The code is still a little messy, but if you'd like to snoop it, you have the option of either a tarball or viewcvs from here:

http://www.rectangular.com/kinosearch/

I'll be directing attention towards one particular section of code in my post on merge models.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to