[sword-devel] Searching and Lucene thoughts

Will Thimbleby Tue, 01 Mar 2005 14:48:10 -0800

I apologies for my ramblings, but here are some searching thoughts that I've collected as I implemented lucene searching in MacSword:

Searching It is more complicated than I thought, and lucene doesn't quite do everything. Certainly to do a document range is something that needs to be bolted onto lucene. In pro bible software it gets very complicated, in accordance you can do some insane searches. Martin might be onto something trying to write his own, it would be fine to take 10x as long as lucene, and support everything we want. But on the other hand this is the guy writing lucene (http://lucene.sourceforge.net/publications.html) it might make sense to alter lucene to our requirements.

GCJ Lucene vs CLucene vs Lucene: I tried to compile lucene with gcj (the svn distribution of lucene comes with a make file that worked straight off) it weighs in at 1.3mb, but you will probably still need some part of the 5mb libgcj library. I didn't get any further with this solution. Might be a possibility, but I haven't yet built anything with it.

Troy: you asked for my code to access index order, I can give you java code, but clucene doesn't support it yet. There seem to be many areas where clucene is lagging far behind lucene. For example, sorting which to do in lucene is essential for fast searching.

Indexes: A file storing the module version and the index method version is essential. I have changed my index structure several times, and probably will do in the future (eg. for morphology searching). I don't store the indexes with the modules in case the modules are loaded from a CD or locked.

Top twenty words in KJV: unto, shall, lord, he, his, all, thou, them, which, i, him, said, thy, from, god, thee, ye, shalt, children, israel

Lucene index types and indexing speed: KJV index with java version of lucene = 8'38 (3MB) using the simple analyser = 8'02 (3.1MB) using setMergeFactor(1000); setMaxBufferedDocs(1000) (previously called minmergedocs) = 5'47 -- uses about 90mb of memory change these two parameters to control excessive file handles.

Size of index 2.6Mb or 6.8MB storing the verses. Note that the KJV = 5.4MB. Thus the KJV and an index is larger than the index with stored verses. It is also faster to access, but probably takes up a load of memory. ;P

Analysers: The standard analyser looks for things like emails and other stuff -- and last I checked Jesus didn't have an email address. The stop analyser might be better if we want to cull words like "and" and "the", but why stop the user. There are 23867 verses containing "and" in the KJV. :) The standard analyser also culls apostrophes, (I don't think we want to)

Speed: Look up is fast, but I render all the verses which takes far longer. Note that this isn't so important now, because I only load the verses when they are displayed and then I cache them, which reduces the display time to nothing.

Search for "jesus" 943 results
Search: 67ms (negligible)
Display: 21s
Display (stored in lucene): 3s

Search for "god" 3892 results
Search: 13ms
Display:  1'10s
Display (stored):  11s

Search for "god*" 4094 results
Search: 40ms
Display: 1'11s
Display (stored): 11s

Ordering of searches: The results really need to be ordered by bible verse, lucene's ranking means that the shortest verses always come first, eg. "Jesus wept." is always the top verse for "jesus". IMO this doesn't make much sense to the user. My current solution is to sort by index order. Another solution is to store keys as indexes: You can store these as a string, lucene can then do the sorting for you. (NB you seem to need store them as fixed width strings).

Restricting of searches: Again another area that is essential for speed to do in lucene. I haven't figured this one out yet, but I'm thinking I will write a custom lucene filter. Which would be much faster if I stored the verse as an index, and then produced a set of numerical ranges. For searching in the previous results, you should (I've been told) simply AND the searches together. I don't support these yet, and it is probably quite some work, -- it would probably only take 10s of searching time to retrofit it ontop of lucene, but that is 10s ontop of nothing.

Other stuff: Fuzzy searches are neat "abraham~" finds abram and abraham; "hezikia~" finds hezekiah. Really useful for bad spellers and all those ridiculously impossible to spell bible names. To highlight searches, you can get lucene to give you a list of words for a search. You can then highlight all of these words in the verse.

IMO rarely do people want to do OR searches, so I changed the default to AND in the lucene version used by MacSword. This means >>jesus wept<< is ANDed >>jesus OR wept<< is ORed, and >>"jesus wept"<< is the phrase. Other than that the lucene syntax makes sense.


cheers --Will

_______________________________________________
sword-devel mailing list
sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel

[sword-devel] Searching and Lucene thoughts

Reply via email to