Perl progress

Marvin Humphrey Tue, 27 Sep 2005 14:45:34 -0700

Greets,

A week ago, a revamped indexer based on my Perl search enginelibrary, KinoSearch, successfully built a Lucene-compatible index.The corpus was 1,000 documents from Wikipedia.


Better, it did so in a reasonable amount of time:

Time to index 1000 docs on my G4 laptop
=======================================
Plucene 1.25                   270 secs
KinoSearch 0.05_02              20 secs
Java Lucene                      9 secs

There are a number of fundamental architectural differences betweenKinoSearch and Lucene, and by extension between KinoSearch andPlucene, which is largely a faithful port of Java Lucene. The mostimportant of these is the merge model, which I plan to address in aseparate post, but briefly: Lucene builds miniature-inverted-indexesfor each document, then merges them into ever larger indexes on aschedule determined by mergeFactor. KinoSearch builds indexes onesegment at a time, and no coherent mini-inverted-index ever existswhich is smaller than a segment.


Two other important differences:

1) KinoSearch requires that all fields be defined in advance whencreating a segment. The Documents which you add may not containfields which have not been declared, and you cannot update thedefinition of a field once it is set. Segments with differing fielddefs can be reconciled -- you just can't change up a def in themiddle of creating a segment. Additionally, KinoSearch will not mergefields with the same fieldname -- it will overwrite. Insisting onrigid field definitions means that the KinoSearch equivalents ofFieldInfos, DocumentWriter, FieldsWriter, FieldInfosWriter,TermInfosWriter and such can all be instantiated once per segment,rather than once per document; in Perl, with its comparativelysluggish OO implementation, that adds up.

2) Analyzers in KinoSearch deal with batches of tokens rather thanstreams. The concept of a TokenStream simply does not translate intoefficient Perl.

It may be possible to squeeze more speed out of this indexer, butthat's no longer a top priority. Top priority is: adapt KinoSearch'ssearch modules to work with the Lucene file format. After that, thegoal will be to implement a limited, maintainably small subset ofLucene's functionality. For instance, I only plan to supportcomposite indexes written by Lucene 1.9 (or whatever version startswriting valid UTF-8) or later.

The code is still a little messy, but if you'd like to snoop it, youhave the option of either a tarball or viewcvs from here:


http://www.rectangular.com/kinosearch/

I'll be directing attention towards one particular section of code inmy post on merge models.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Perl progress

Reply via email to