WordDelimiterFilter+LenghtFilter results in termPosition==-1

Walter Ferrara Wed, 26 Mar 2008 07:57:08 -0700

I'm using an anlyzer with WordDelimiterFilter before a LenghtFilter;When I ran org.apache.lucene.index.CheckIndex on my solr index, I getsome errors.CheckIndex complains that some doc have terms uncorrectly positioned in-1, beware that solr doesn't complain in any way, but Luke"Reconstruct&Edit" and CheckIndex complain about "damaged" index.

Investigating on the problem, it seems related to lengthfilter or maybewith the union of LenghtFilter + word WordDelimiterFilter; I tried toreproduce the same issue on solr hudson build(apache-solr-2008-03-24_09-57-01.zip) example dir, (un)fortunately withsuccess.


Step to reproduce this issue:
1) Take trunk example dir.

2) change schema xml fieldtype text, and simply add a lengthfilter intext fieldtype somewhere in the chain AFTER the WordDelimiterFilter:

===============
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
         <analyzer type="index">
           <tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt"/><filter class="solr.WordDelimiterFilterFactory"generateWordParts="1" generateNumberParts="1" catenateWords="1"catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

           <filter class="solr.LengthFilterFactory" min="3" max="50" />
           <filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.EnglishPorterFilterFactory"protected="protwords.txt"/>

           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
         </analyzer>
         <analyzer type="query">
           <tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.SynonymFilterFactory"synonyms="synonyms.txt" ignoreCase="true" expand="true"/><filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt"/><filter class="solr.WordDelimiterFilterFactory"generateWordParts="1" generateNumberParts="1" catenateWords="0"catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>

           <filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.EnglishPorterFilterFactory"protected="protwords.txt"/>

           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
         </analyzer>
   </fieldType>
===========
3) run solr, i.e. java -Xmx200m -jar start.jar

Now, put in index with post.sh this xml:
========
<add><doc>
 <field name="id">0579B002</field>
 <field name="name">testname</field>
 <field name="manu">testmanu</field>
 <field name="cat">testcat</field>
 <field name="features">U.S.A. and U.K.</field>
 <field name="inStock">true</field>
</doc></add>
========

and run this:

java -cp lucene-core-2.3.1.jar org.apache.lucene.index.CheckIndexsolr/data/index


you will get something like:
 3 of 3: name=_2 docCount=1
   compound=false
   numFiles=11
   size (MB)=0,001
   no deletions
   test: open reader.........OK
   test: fields, norms.......OK [13 fields]
   test: terms, freq, prox...FAILED

WARNING: would remove reference to this segment (-fix was notspecified); full exception:java.lang.RuntimeException: term features:usa: doc 0: pos -1 is out ofbounds

       at org.apache.lucene.index.CheckIndex.check(CheckIndex.java:205)
       at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:362)

An easier test is to open Luke on that index, go to the last documentand click on "Reconstruct&Edit", and you will get an exception.If you analyze with analisys.jsp the query "U.S.A. and U.K." you willnotice USA to be positioned to 0, while initially it was on 1 - I don'tknow if this just mean "the token wasn't originally present in the doc",but as CheckIndex complains about it, it make me thing this is a bug.


NOTICE:

1) putting LengthFilter BEFORE the WordDelimiterFilter seems to run,however the results will change, as some entry will be disjoined tolittle piece (U.S.A. => will pass the length filter before theWordDelimiterFilter ('U', 'S', 'A', 'USA' will be taken) but it wouldn'tpass the lengthfilter if it was after the lengthfilter ('U' 'S' 'A' willbe ignored, 'USA' will be taken); I would like to achieve the latter.

2) This mess up the index only (I presume) when the first entry (token)on the field to be analyzed (indexed) contains dots; but what happens inother situation? maybe other documents have wrong term positions (wrong,but not -1, like 'Let's go to U.S.A.')?

It this doesn't reproduce the bug tell me and I will send you both files(schema.xml and up.xml files), I don't know if this mailing list acceptattachments

Any clue on this issue? AFAIK it should be not a problem to put a filteron a chain, so do I am messing things up, or is this a bug?


Walter

WordDelimiterFilter+LenghtFilter results in termPosition==-1

Reply via email to