Re: Indexing tweet and searching "@keyword" OR "#keyword"

Erick Erickson Wed, 10 Aug 2011 07:20:10 -0700

Please look more carefully at the documentation for WDDF,
specifically:

split on intra-word delimiters (all non alpha-numeric characters).


WordDelimiterFilterFactory will always throw away non alpha-numeric
characters, you can't tell it do to otherwise. Try some of the other
tokenizers/analyzers to get what you want, and also look at the
admin/analysis page to see what the exact effects are of your
fieldType definitions.

Here's a great place to start:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

You probably want something like WhitespaceTokenizerFactory
followed by LowerCaseFilterFactory or some such...

But I really question whether this is what you want either. Do you
really want a search on "ipad" to *fail* to match input of "#ipad"? Or
vice-versa?

KeywordTokenizerFactory is probably not the place you want to start,
the tokenization process doesn't break anything up, you happen to be
getting separate tokens because of WDDF, which as you see can't
process things the way you want.


Best
Erick

On Wed, Aug 10, 2011 at 3:09 AM, Mohammad Shariq <shariqn...@gmail.com> wrote:
> I tried tweaking "WordDelimiterFactory" but I won't accept # OR @ symbols
> and it ignored totally.
> I need solution plz suggest.
>
> On 4 August 2011 21:08, Jonathan Rochkind <rochk...@jhu.edu> wrote:
>
>> It's the WordDelimiterFactory in your filter chain that's removing the
>> punctuation entirely from your index, I think.
>>
>> Read up on what the WordDelimiter filter does, and what it's settings are;
>> decide how you want things to be tokenized in your index to get the behavior
>> your want; either get WordDelimiter to do it that way by passing it
>> different arguments, or stop using WordDelimiter; come back with any
>> questions after trying that!
>>
>>
>>
>> On 8/4/2011 11:22 AM, Mohammad Shariq wrote:
>>
>>> I have indexed around 1 million tweets ( using  "text" dataType).
>>> when I search the tweet with "#"  OR "@"  I dont get the exact result.
>>> e.g.  when I search for "#ipad" OR "@ipad"   I get the result where ipad
>>> is
>>> mentioned skipping the "#" and "@".
>>> please suggest me, how to tune or what are filterFactories to use to get
>>> the
>>> desired result.
>>> I am indexing the tweet as "text", below is "text" which is there in my
>>> schema.xml.
>>>
>>>
>>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>>> <analyzer type="index">
>>>     <tokenizer class="solr.**KeywordTokenizerFactory"/>
>>>     <filter class="solr.**CommonGramsFilterFactory" words="stopwords.txt"
>>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
>>>     <filter class="solr.**WordDelimiterFilterFactory"
>>> generateWordParts="1"
>>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>>> catenateAll="0" splitOnCaseChange="1"/>
>>>     <filter class="solr.**LowerCaseFilterFactory"/>
>>>     <filter class="solr.**SnowballPorterFilterFactory"
>>> protected="protwords.txt" language="English"/>
>>> </analyzer>
>>> <analyzer type="query">
>>>         <tokenizer class="solr.**KeywordTokenizerFactory"/>
>>>         <filter class="solr.**CommonGramsFilterFactory"
>>> words="stopwords.txt"
>>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
>>>         <filter class="solr.**WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>         <filter class="solr.**LowerCaseFilterFactory"/>
>>>         <filter class="solr.**SnowballPorterFilterFactory"
>>> protected="protwords.txt" language="English"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>>
>
>
> --
> Thanks and Regards
> Mohammad Shariq
>

Re: Indexing tweet and searching "@keyword" OR "#keyword"

Reply via email to