It's the WordDelimiterFactory in your filter chain that's removing the
punctuation entirely from your index, I think.
Read up on what the WordDelimiter filter does, and what it's settings
are; decide how you want things to be tokenized in your index to get the
behavior your want; either get WordDelimiter to do it that way by
passing it different arguments, or stop using WordDelimiter; come back
with any questions after trying that!
On 8/4/2011 11:22 AM, Mohammad Shariq wrote:
I have indexed around 1 million tweets ( using "text" dataType).
when I search the tweet with "#" OR "@" I dont get the exact result.
e.g. when I search for "#ipad" OR "@ipad" I get the result where ipad is
mentioned skipping the "#" and "@".
please suggest me, how to tune or what are filterFactories to use to get the
desired result.
I am indexing the tweet as "text", below is "text" which is there in my
schema.xml.
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.CommonGramsFilterFactory" words="stopwords.txt"
minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
protected="protwords.txt" language="English"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.CommonGramsFilterFactory" words="stopwords.txt"
minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
protected="protwords.txt" language="English"/>
</analyzer>
</fieldType>