Hi Steven, Hm, not being able to find the exact original phrase indeed sounds buggy to me, worthy of a JIRA issue and a unit test that shows this happening, if you can?
Thanks, Otis ---- Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html >________________________________ > From: Steven Fuchs <st...@aps.org> >To: solr-user@lucene.apache.org >Sent: Monday, December 19, 2011 10:59 PM >Subject: issues with WordDelimiterFilter > >Hello All, >I'm having an issue with the way the WordDelimiterFilter parses compound >words. My field declaration is simple, looks like this: > > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > >When indexing 'fokker-plank' I do get the token for both fokker, planck, and >fokker-planck. But in that case the fokker-planck token it is followed by a >'planck' token. The analysis looks like this. > > >position 1 2 >term text fokker-planck planck > fokker (table layout implies planck) >startOffset 0 7 > 0 > > >So in the case where fokker-plank is the first token there should be no second >token, its already been used if the first was matched. The problem manifests >itself when doing phrase searches... > >"Fokker-Plank equations" won't find the exact phrase, Fokker-Plank equations, >because its sees the term planck as between Fokker-Plank and equations. Hope >that makes sense! Should I submit this as a bug? > >As it stands it would return a true hit (erroneously I believe) on the phrase >search "fokker planck", so really all 3 tokens should be returned at offset 0 >and there should be no second token so phrase searches are preserved. > >Thanks in advance >Steven Fuchs > >