I'm using an anlyzer with WordDelimiterFilter before a LenghtFilter; When I ran org.apache.lucene.index.CheckIndex on my solr index, I get some errors. CheckIndex complains that some doc have terms uncorrectly positioned in -1, beware that solr doesn't complain in any way, but Luke "Reconstruct&Edit" and CheckIndex complain about "damaged" index.

Investigating on the problem, it seems related to lengthfilter or maybe with the union of LenghtFilter + word WordDelimiterFilter; I tried to reproduce the same issue on solr hudson build (apache-solr-2008-03-24_09-57-01.zip) example dir, (un)fortunately with success.

Step to reproduce this issue:
1) Take trunk example dir.

2) change schema xml fieldtype text, and simply add a lengthfilter in text fieldtype somewhere in the chain AFTER the WordDelimiterFilter:
===============
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
         <analyzer type="index">
           <tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
           <filter class="solr.LengthFilterFactory" min="3" max="50" />
           <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
         </analyzer>
         <analyzer type="query">
           <tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
           <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
         </analyzer>
   </fieldType>
===========
3) run solr, i.e. java -Xmx200m -jar start.jar

Now, put in index with post.sh this xml:
========
<add><doc>
 <field name="id">0579B002</field>
 <field name="name">testname</field>
 <field name="manu">testmanu</field>
 <field name="cat">testcat</field>
 <field name="features">U.S.A. and U.K.</field>
 <field name="inStock">true</field>
</doc></add>
========

and run this:
java -cp lucene-core-2.3.1.jar org.apache.lucene.index.CheckIndex solr/data/index

you will get something like:
 3 of 3: name=_2 docCount=1
   compound=false
   numFiles=11
   size (MB)=0,001
   no deletions
   test: open reader.........OK
   test: fields, norms.......OK [13 fields]
   test: terms, freq, prox...FAILED
WARNING: would remove reference to this segment (-fix was not specified); full exception: java.lang.RuntimeException: term features:usa: doc 0: pos -1 is out of bounds
       at org.apache.lucene.index.CheckIndex.check(CheckIndex.java:205)
       at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:362)

An easier test is to open Luke on that index, go to the last document and click on "Reconstruct&Edit", and you will get an exception. If you analyze with analisys.jsp the query "U.S.A. and U.K." you will notice USA to be positioned to 0, while initially it was on 1 - I don't know if this just mean "the token wasn't originally present in the doc", but as CheckIndex complains about it, it make me thing this is a bug.

NOTICE:
1) putting LengthFilter BEFORE the WordDelimiterFilter seems to run, however the results will change, as some entry will be disjoined to little piece (U.S.A. => will pass the length filter before the WordDelimiterFilter ('U', 'S', 'A', 'USA' will be taken) but it wouldn't pass the lengthfilter if it was after the lengthfilter ('U' 'S' 'A' will be ignored, 'USA' will be taken); I would like to achieve the latter.

2) This mess up the index only (I presume) when the first entry (token) on the field to be analyzed (indexed) contains dots; but what happens in other situation? maybe other documents have wrong term positions (wrong, but not -1, like 'Let's go to U.S.A.')?

It this doesn't reproduce the bug tell me and I will send you both files (schema.xml and up.xml files), I don't know if this mailing list accept attachments

Any clue on this issue? AFAIK it should be not a problem to put a filter on a chain, so do I am messing things up, or is this a bug?

Walter

Reply via email to