It's the WordDelimiterFactory in your filter chain that's removing the punctuation entirely from your index, I think.

Read up on what the WordDelimiter filter does, and what it's settings are; decide how you want things to be tokenized in your index to get the behavior your want; either get WordDelimiter to do it that way by passing it different arguments, or stop using WordDelimiter; come back with any questions after trying that!


On 8/4/2011 11:22 AM, Mohammad Shariq wrote:
I have indexed around 1 million tweets ( using  "text" dataType).
when I search the tweet with "#"  OR "@"  I dont get the exact result.
e.g.  when I search for "#ipad" OR "@ipad"   I get the result where ipad is
mentioned skipping the "#" and "@".
please suggest me, how to tune or what are filterFactories to use to get the
desired result.
I am indexing the tweet as "text", below is "text" which is there in my
schema.xml.


<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
     <tokenizer class="solr.KeywordTokenizerFactory"/>
     <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt"
minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
     <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.SnowballPorterFilterFactory"
protected="protwords.txt" language="English"/>
</analyzer>
<analyzer type="query">
         <tokenizer class="solr.KeywordTokenizerFactory"/>
         <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt"
minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
         <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.SnowballPorterFilterFactory"
protected="protwords.txt" language="English"/>
</analyzer>
</fieldType>

Reply via email to