Hi Steven,

Hm, not being able to find the exact original phrase indeed sounds buggy to me, 
worthy of a JIRA issue and a unit test that shows this happening, if you can?

Thanks,
Otis 
----
Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html



>________________________________
> From: Steven Fuchs <st...@aps.org>
>To: solr-user@lucene.apache.org 
>Sent: Monday, December 19, 2011 10:59 PM
>Subject: issues with WordDelimiterFilter
> 
>Hello All,
>I'm having an issue with the way the WordDelimiterFilter parses compound 
>words. My field declaration is simple, looks like this:
>
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>
>When indexing 'fokker-plank' I do get the token for both fokker, planck, and 
>fokker-planck. But in that case the fokker-planck token it is followed by a 
>'planck' token. The analysis looks like this.
>
>
>position            1                    2
>term text         fokker-planck        planck
>                fokker                (table layout implies planck)
>startOffset         0                    7
>                0
>
>
>So in the case where fokker-plank is the first token there should be no second 
>token, its already been used if the first was matched. The problem manifests 
>itself when doing phrase searches...
>
>"Fokker-Plank equations" won't find the exact phrase, Fokker-Plank equations, 
>because its sees the term planck as between Fokker-Plank and equations. Hope 
>that makes sense! Should I submit this as a bug?
>
>As it stands it would return a true hit (erroneously I believe) on the phrase 
>search "fokker planck", so really all 3 tokens should be returned at offset 0 
>and there should be no second token so phrase searches are preserved.
>
>Thanks in advance
>Steven Fuchs
>
>

Reply via email to