Re: [sword-devel] Searching and Lucene thoughts

[EMAIL PROTECTED] Wed, 02 Mar 2005 13:55:56 -0800

I have implemented the Boyer-Moore before. I think that it is a bit biased toward common prefix languages as you stated. But it would work on any. The problem that I encountered is that it is easy to write for 7 bit ascii and later for 8 bit latin-1, but the compiled fsa becomes difficult for a large character set. (I.e. my simple algorithm won't work)

The reason it works is that when it fails on matching a word, it does not need to start at the next character, but it can skip ahead. The longer the word, the better the speed.

Here is a very basic understanding: if in searching for the word "blessing" one were to match "bl" and then fail on an "a" you know that you do not need to restart on "l" or "a" but you can start on a later letter. Where you start is based upon the letters in the word being searched and the letter that you failed upon.

However, the Boyer-Moyer does not start from the beginning of the word, but from the end. So it tries to match the "g" first and then the "n". If it were looking at the word blasphemy, the eigth letter of "blessing", the "g", would have been checked against the eigth leter of "blasphemy", the "m". Since "g" does not occur in "blessing" more than once and since "m" does not occur in it at all, we can then skip forward 8 letters. This is a tremendous savings.

If the movement from latin-1 to utf-8 and old search to lucene is at all slow, then I think it may be worth it to implement it.

Chris Little wrote:

No. Standard Sword searches just start at the beginning and search to the end, byte by byte.

Just on the basis of the abstract you link to, I don't see how this would be of any benefit. The Boyer-Moore algorithm is very language-specific. It benefits from the fact that English is a predominantly suffixing language, as are most European languages, I would say. Personally, I have difficulty imagining how this actually speeds search times, but I assume they've done testing and that their claims are accurate.

The standard linear search is the most general purpose search algorithm, and I think general purpose is what we need to maintain. For people who want faster searches, there is indexed searching available.
--Chris
Lynn Allan wrote:
<alert comment="iwnacsmndipootv ... i was not a computer science major
... ">
Just curious ... does non-indexed sword-api searching use c.s. algorithms like Boyer-Moore searching? http://portal.acm.org/citation.cfm?id=359859&coll=ACM&dl=ACM&CFID=13545783&CFTOKEN=93236524

Something I tried to read once (and it was waaaaaay over my head) concerned very smart "state machine" searching when there is more than one word being searched for. Seems like it involved Bell Lab researchers? From one of the A or W or K dudes? http://portal.acm.org/citation.cfm?id=360855&coll=ACM&dl=ACM&CFID=13626066&CFTOKEN=93658335
Does D. Knuth discuss string matching optimizations?
Would that be applicable to the sword-api?
</alert>
_______________________________________________
sword-devel mailing list
sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
_______________________________________________
sword-devel mailing list
sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel


_______________________________________________
sword-devel mailing list
sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel

Re: [sword-devel] Searching and Lucene thoughts

Reply via email to