Will Thimbleby wrote:
I apologies for my ramblings, but here are some searching thoughts that I've collected as I implemented lucene searching in MacSword:Can we enumerate what Lucene does not support that we want for Biblical searching?
Searching
It is more complicated than I thought, and lucene doesn't quite do everything. Certainly to do a document range is something that needs to be bolted onto lucene. In pro bible software it gets very complicated, in accordance you can do some insane searches. Martin might be onto something trying to write his own, it would be fine to take 10x as long as lucene, and support everything we want. But on the other hand this is the guy writing lucene (http://lucene.sourceforge.net/publications.html) it might make sense to alter lucene to our requirements.
The only thing I saw was that it did not find adjacent documents. For example, find all verses containing Moses within 5 verses of Aaron.
As long as we build the index from first verse to last verse, the index that lucene returns is the number lucene assigned to the verse when the verse was added. We cannot reliably use this to figure out what verse is returned (e.g. 3 may or may not mean Genesis 1:3. For example, in a NT only module it would mean Matthew 1:3), for this reason we have stored the OSIS reference in the index along with the verse. However, we can be certain (cause lucene guarantees it) that index 25 and index 26 are two verses that were added one after the other.
To do proximity searching, we probably have to parse the search request for a special w/in conjunction and take each part and do separate queries, an via post processing, put the result together.
Has anyone thought of another way?
Troy: you asked for my code to access index order, I can give you java code, but clucene doesn't support it yet. There seem to be many areas where clucene is lagging far behind lucene. For example, sorting which to do in lucene is essential for fast searching.
I would be interested in the Java code, if you don't mind.
Indexes:
A file storing the module version and the index method version is essential. I have changed my index structure several times, and probably will do in the future (eg. for morphology searching). I don't store the indexes with the modules in case the modules are loaded from a CD or locked.
Can you send me your code that builds the index as well?
I agree that it probably would be best to store the index separate from the module.
<snip/>
Analysers:For JSword, we have been using the Standard Analyzer, but after your comments I took a look at the lucene code and I think that you are right. It increases the size of the index by a meg, but that is not that big a deal. I think that it will reduce the CPU usage as well. Time to do some more experimenting....
The standard analyser looks for things like emails and other stuff -- and last I checked Jesus didn't have an email address. The stop analyser might be better if we want to cull words like "and" and "the", but why stop the user. There are 23867 verses containing "and" in the KJV. :) The standard analyser also culls apostrophes, (I don't think we want to)
<snip/>
Ordering of searches:
The results really need to be ordered by bible verse, lucene's ranking means that the shortest verses always come first, eg. "Jesus wept." is always the top verse for "jesus". IMO this doesn't make much sense to the user.
My current solution is to sort by index order. Another solution is to store keys as indexes: You can store these as a string, lucene can then do the sorting for you. (NB you seem to need store them as fixed width strings).
Just a suggestion (which we use in JSword), use a BitSet to store the hits. It takes 31102 bits to represent the entire bible. This comes to 3.8K. The bitset is implicitly ordered. Java allows pretty efficient iterating over the set.
I think there is room for two different kinds of searches:
1) Find verses that match the criteria that I provide. (Standard boolean searches)
2) Fuzzy searches, natural language searches, more like this searches, help me find a verse which is something like this search.
In the first case the answer set probably is best ordered by bible verse. The second is probably better ordered by ranking.
Restricting of searches:
Again another area that is essential for speed to do in lucene. I haven't figured this one out yet, but I'm thinking I will write a custom lucene filter. Which would be much faster if I stored the verse as an index, and then produced a set of numerical ranges. For searching in the previous results, you should (I've been told) simply AND the searches together. I don't support these yet, and it is probably quite some work, -- it would probably only take 10s of searching time to retrofit it ontop of lucene, but that is 10s ontop of nothing.
The search speed of lucene is fast enough that restricting the search is not necessary. Using the BitSet does not add appreciable time. It is easy enough to create a mask and AND that with the search results to get the restricted answer set.
Other stuff:
Fuzzy searches are neat "abraham~" finds abram and abraham; "hezikia~" finds hezekiah. Really useful for bad spellers and all those ridiculously impossible to spell bible names.
To highlight searches, you can get lucene to give you a list of words for a search. You can then highlight all of these words in the verse.
I saw your other post on fuzzy match and would like to know how you got the words that were hit out of lucene.
IMO rarely do people want to do OR searches, so I changed the default to AND in the lucene version used by MacSword. This means >>jesus wept<< is ANDed >>jesus OR wept<< is ORed, and >>"jesus wept"<< is the phrase. Other than that the lucene syntax makes sense.
On a project that I did we found that people wanted to do phrase searching even more than AND and AND more than OR, unless they were doing "natural language" quering.
It might be nice to set it as a preference.
_______________________________________________ sword-devel mailing list sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel