I have been reading threads all day regarding this topic and nothing
seems to work the way it says it should. :) I appreciate any and all
help in this matter.
Solr 4 is working perfectly for in all regards with this one exception.
My requirement from Solr4 is very simple. I am storing a document
like a job description in a text_general field.
I have added a filter for SynonymFilterFactory so that I can map C++
=> cplusplus and c# => csharp during indexing a querying.
Here is the field definition:
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="punctuation-whitelist.txt" ignoreCase="true"
expand="false"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="punctuation-whitelist.txt" ignoreCase="true"
expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Here is the contents of punctuation-whitelist.txt:
c++ => cplusplus
C# => csharp
I have but one document indexed for the purpose of this test, when I
search for resume_text:C++, I get the following result, which is also
the same result I get when I just search for resume_text:c
You can see from the highlighting that solr is matching on the "C" only
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">20</int>
</lst>
<result name="response" numFound="1" start="0" maxScore="0.16273327">
<doc>
<arr name="resume_text">
<str>C++ Developer with c# experience,
including .net</str>
</arr>
</doc>
</result>
<lst name="highlighting">
<lst name="208645">
<arr name="resume_text">
<str><em>C</em>++ Developer with
<em>c</em># experience, including .net</str>
</arr>
</lst>
</lst>
</response>
If I use the Analysis tool in the Solr Web UI, putting "C#" or "C++"
into the Index or Query boxes translates to just "C" in all filters
and tokenizers in the analysis output.
Can someone please explain the _Best_ way to accomplish what I am
trying to do, which is accurately index, search and highlight text
with words like C++ and C#. I am looking for the "right way" and it's
okay if I have started down the wrong path.
:)
Thank you.
Dave