I'm using an anlyzer with WordDelimiterFilter before a LenghtFilter;
When I ran org.apache.lucene.index.CheckIndex on my solr index, I get
some errors.
CheckIndex complains that some doc have terms uncorrectly positioned in
-1, beware that solr doesn't complain in any way, but Luke
"Reconstruct&Edit" and CheckIndex complain about "damaged" index.
Investigating on the problem, it seems related to lengthfilter or maybe
with the union of LenghtFilter + word WordDelimiterFilter; I tried to
reproduce the same issue on solr hudson build
(apache-solr-2008-03-24_09-57-01.zip) example dir, (un)fortunately with
success.
Step to reproduce this issue:
1) Take trunk example dir.
2) change schema xml fieldtype text, and simply add a lengthfilter in
text fieldtype somewhere in the chain AFTER the WordDelimiterFilter:
===============
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LengthFilterFactory" min="3" max="50" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
===========
3) run solr, i.e. java -Xmx200m -jar start.jar
Now, put in index with post.sh this xml:
========
<add><doc>
<field name="id">0579B002</field>
<field name="name">testname</field>
<field name="manu">testmanu</field>
<field name="cat">testcat</field>
<field name="features">U.S.A. and U.K.</field>
<field name="inStock">true</field>
</doc></add>
========
and run this:
java -cp lucene-core-2.3.1.jar org.apache.lucene.index.CheckIndex
solr/data/index
you will get something like:
3 of 3: name=_2 docCount=1
compound=false
numFiles=11
size (MB)=0,001
no deletions
test: open reader.........OK
test: fields, norms.......OK [13 fields]
test: terms, freq, prox...FAILED
WARNING: would remove reference to this segment (-fix was not
specified); full exception:
java.lang.RuntimeException: term features:usa: doc 0: pos -1 is out of
bounds
at org.apache.lucene.index.CheckIndex.check(CheckIndex.java:205)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:362)
An easier test is to open Luke on that index, go to the last document
and click on "Reconstruct&Edit", and you will get an exception.
If you analyze with analisys.jsp the query "U.S.A. and U.K." you will
notice USA to be positioned to 0, while initially it was on 1 - I don't
know if this just mean "the token wasn't originally present in the doc",
but as CheckIndex complains about it, it make me thing this is a bug.
NOTICE:
1) putting LengthFilter BEFORE the WordDelimiterFilter seems to run,
however the results will change, as some entry will be disjoined to
little piece (U.S.A. => will pass the length filter before the
WordDelimiterFilter ('U', 'S', 'A', 'USA' will be taken) but it wouldn't
pass the lengthfilter if it was after the lengthfilter ('U' 'S' 'A' will
be ignored, 'USA' will be taken); I would like to achieve the latter.
2) This mess up the index only (I presume) when the first entry (token)
on the field to be analyzed (indexed) contains dots; but what happens in
other situation? maybe other documents have wrong term positions (wrong,
but not -1, like 'Let's go to U.S.A.')?
It this doesn't reproduce the bug tell me and I will send you both files
(schema.xml and up.xml files), I don't know if this mailing list accept
attachments
Any clue on this issue? AFAIK it should be not a problem to put a filter
on a chain, so do I am messing things up, or is this a bug?
Walter