Re: [sword-devel] Searching and Lucene thoughts

Will Thimbleby Wed, 02 Mar 2005 15:55:59 -0800

On 2 Mar 2005, at 12:45 am, DM Smith wrote:

Can we enumerate what Lucene does not support that we want for Biblical searching?

The only thing I saw was that it did not find adjacent documents. For example, find all verses containing Moses within 5 verses of Aaron.

As long as we build the index from first verse to last verse, the index that lucene returns is the number lucene assigned to the verse when the verse was added. We cannot reliably use this to figure out what verse is returned (e.g. 3 may or may not mean Genesis 1:3. For example, in a NT only module it would mean Matthew 1:3), for this reason we have stored the OSIS reference in the index along with the verse. However, we can be certain (cause lucene guarantees it) that index 25 and index 26 are two verses that were added one after the other.

To do proximity searching, we probably have to parse the search request for a special w/in conjunction and take each part and do separate queries, an via post processing, put the result together.

Has anyone thought of another way?

Here are some things Accordance does: -- it just seems over complicated to me (I can't see how some of the features would ever be used other than tedious academic research)

It can search within: verse, chapter, clause, sentance, paragraph, book
You can specify tags for: stem, aspect, person, gender, number, state
Examples:
creat* <FOLLOWED BY> <WITHIN 10 WORDS> earth <NOT> made
[VERB perfect] @~~~ (hebrew chars)

The only thing afaik that lucene wont do for us with a bit of work is to do multi-document searching. Searching across verses is confusing -- the only constructs that make sense are proximity constructs. Looking at the source for lucene I *might* actually be able to do this. I'll get back to you on this.

Troy: you asked for my code to access index order, I can give you java code, but clucene doesn't support it yet. There seem to be many areas where clucene is lagging far behind lucene. For example, sorting which to do in lucene is essential for fast searching.

I would be interested in the Java code, if you don't mind.

I don't access it as such I just pass the index sorter to the searcher eg. s.search(query, Sort.INDEXORDER) I'm not sure how to access the id itself.

<snip/>
Restricting of searches: Again another area that is essential for speed to do in lucene. I haven't figured this one out yet, but I'm thinking I will write a custom lucene filter. Which would be much faster if I stored the verse as an index, and then produced a set of numerical ranges. For searching in the previous results, you should (I've been told) simply AND the searches together. I don't support these yet, and it is probably quite some work, -- it would probably only take 10s of searching time to retrofit it ontop of lucene, but that is 10s ontop of nothing.
The search speed of lucene is fast enough that restricting the search is not necessary. Using the BitSet does not add appreciable time. It is easy enough to create a mask and AND that with the search results to get the restricted answer set.

How do you use your BitSet? I like it at the moment where I don't access the document information at all until it is displayed. This means I can do live-searching (as the user types) for even large searches like "and".

Other stuff: Fuzzy searches are neat "abraham~" finds abram and abraham; "hezikia~" finds hezekiah. Really useful for bad spellers and all those ridiculously impossible to spell bible names. To highlight searches, you can get lucene to give you a list of words for a search. You can then highlight all of these words in the verse.
I saw your other post on fuzzy match and would like to know how you got the words that were hit out of lucene.

Have a look in lucene/contrib/highlighter/ ... /QueryTermExtractor.java I just cut the useful bits from it.

IMO rarely do people want to do OR searches, so I changed the default to AND in the lucene version used by MacSword. This means >>jesus wept<< is ANDed >>jesus OR wept<< is ORed, and >>"jesus wept"<< is the phrase. Other than that the lucene syntax makes sense.
On a project that I did we found that people wanted to do phrase searching even more than AND and AND more than OR, unless they were doing "natural language" quering. It might be nice to set it as a preference.

I thought that would be the case, however from a syntax point of view I found having phrases as default confusing. I think >"jesus is" +god< is clearer than >jesus is +god< -- but there is possibly a better way. I don't think having a preference is a good idea though, I like having one syntax it is one less thing for the user and simplifies my search windows.

_______________________________________________
sword-devel mailing list
sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel

Re: [sword-devel] Searching and Lucene thoughts

Reply via email to