Re: Indexing tweet and searching "@keyword" OR "#keyword"

Mohammad Shariq Wed, 10 Aug 2011 00:10:16 -0700

I tried tweaking "WordDelimiterFactory" but I won't accept # OR @ symbols
and it ignored totally.
I need solution plz suggest.


On 4 August 2011 21:08, Jonathan Rochkind <rochk...@jhu.edu> wrote:

> It's the WordDelimiterFactory in your filter chain that's removing the
> punctuation entirely from your index, I think.
>
> Read up on what the WordDelimiter filter does, and what it's settings are;
> decide how you want things to be tokenized in your index to get the behavior
> your want; either get WordDelimiter to do it that way by passing it
> different arguments, or stop using WordDelimiter; come back with any
> questions after trying that!
>
>
>
> On 8/4/2011 11:22 AM, Mohammad Shariq wrote:
>
>> I have indexed around 1 million tweets ( using  "text" dataType).
>> when I search the tweet with "#"  OR "@"  I dont get the exact result.
>> e.g.  when I search for "#ipad" OR "@ipad"   I get the result where ipad
>> is
>> mentioned skipping the "#" and "@".
>> please suggest me, how to tune or what are filterFactories to use to get
>> the
>> desired result.
>> I am indexing the tweet as "text", below is "text" which is there in my
>> schema.xml.
>>
>>
>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>> <analyzer type="index">
>>     <tokenizer class="solr.**KeywordTokenizerFactory"/>
>>     <filter class="solr.**CommonGramsFilterFactory" words="stopwords.txt"
>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
>>     <filter class="solr.**WordDelimiterFilterFactory"
>> generateWordParts="1"
>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> catenateAll="0" splitOnCaseChange="1"/>
>>     <filter class="solr.**LowerCaseFilterFactory"/>
>>     <filter class="solr.**SnowballPorterFilterFactory"
>> protected="protwords.txt" language="English"/>
>> </analyzer>
>> <analyzer type="query">
>>         <tokenizer class="solr.**KeywordTokenizerFactory"/>
>>         <filter class="solr.**CommonGramsFilterFactory"
>> words="stopwords.txt"
>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
>>         <filter class="solr.**WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>         <filter class="solr.**LowerCaseFilterFactory"/>
>>         <filter class="solr.**SnowballPorterFilterFactory"
>> protected="protwords.txt" language="English"/>
>> </analyzer>
>> </fieldType>
>>
>>


-- 
Thanks and Regards
Mohammad Shariq

Re: Indexing tweet and searching "@keyword" OR "#keyword"

Reply via email to