FYI: Issue 2, removing stopwords, will break backward compatibility with existing indexes. The existing indexes will not contain the stopwords. New indexes will. This can be very confusing to users.

If backward compatibility is ok to be broken, I suggest changing from StandardAnalyzer to SimpleAnalyzer. It does not have stopwords to begin with and will index the text without the silly transformations that the StandardAnalyzer does.

The segfault is surprising to me. I suggest checking with the clucene folks to see why it is happening. I really doubt it is a bug in clucene but SWORD's use of it.

Adding additional fields probably should be accompanied by adding versioning the index. What the Java Lucene folks are doing for version 3.0 is to store with the index a manifest of sorts that describes what was used to build the index.

In Him,
        DM


On Aug 29, 2009, at 4:25 PM, Matthew Talbert wrote:

I'm attaching a patch to fix several issues with indexed search.

Issue 1: large text fields weren't getting indexed due to a low MAX_CONV_SIZE
    Resolution: change MAX_CONV_SIZE to 1024 * 1024, and add call to
writer to boost its maximum field size

Issue 2: search causes segfault when searching for stop words
    Resolution: set analyzer stop words to NULL for both index
creation and search. Possibly this would only have to be set for
search, and left on to lower the index size.

Issue 3: index causes segfault *after indexing* when module location
isn't writable.
    Resolution: check the return value of
FileMgr::createParent(target + "/dummy"); if return value is -1, abort
indexing

In addition, this patch adds fields for footnotes, morphology, and
headers. I *really* would like to see this added to the default
indexing. The reason is that with indexed search it is possible to
combine fields in one search, something that SWORD attribute search
doesn't allow (AFAIK). And indexed search is much faster, of course.
My patch only covers one of the three spots this would apparently need
to be added. I didn't understand why there was so much duplicated
code, nor was I entirely comfortable with the code I had written, so I
didn't expand it to cover all cases. It appears that the code for
adding fields like strongs is the same in 3 different spots. Surely
this could be condensed somehow?

I really would like to see the first 3 issues fixed immediately (ie,
before next release). Issue 1 makes most genbook indexed search
pointless, while Issues 2 and 3 have both been reported as issues
against Xiphos. Of course, we can't control the segfault in either
case. As far as the extra fields, that will need some extra work, but
I feel it's really important as well. At some point, I am going to
redo the search functionality in Xiphos, and my plan is to implement
indexing myself if these fields aren't in SWORD by then.

I have been meaning to address these issues for some time, but hadn't
gotten around to it yet. The bug report we had forced the issue. While
we're at it, I'd like to bring up two more issues.

1. If the module location isn't writable, there isn't a way for the
user to create an index. I would like to see indexes created somewhere
else in this case, eg ~/.sword/indexes. I believe BT does something
like this already.

2. We currently have no way of notifying the user if the indexes are
no longer valid, or if they should be updated. I would like to see a
versioning scheme for indexes. For example, with the changes here, and
the changes for Hebrew search, all Hebrew indexes previously created
are now useless. How do we tell the user that he needs to re-create
the index? Along the same lines, all genbook indexes, and many
commentary indexes are incorrect. With the next release of SWORD,
hopefully with this issue resolved, it would be nice to be able to
notify the user that the indexes are now out-of-date or incorrect and
need to be rebuilt.

Finally, I would like to point out a great tool for examining
lucene/clucene indexes. You can get it here:
http://www.getopt.org/luke/

Matthew


_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to