Re: Using Edismax

Jan Høydahl Tue, 14 Jun 2011 10:15:09 -0700

Hi,

Let's assume you're using Solr version 3.1.0 and an unmodified FieldType 
"text_rev". It looks like this:


    <fieldType name="text_rev" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
           maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
      </analyzer>
      ...

Also let's assume that what you have two docs in your index with these URLs:
A:"http://my.host/SPC265_SharePoint_2010.pptx";
B:"http://my.host/OpenTRs2010.xlsx";

Now you want to match only A and not B, and you attempt that using q=url:_2010

What happens here can easily be simulated by 
http://localhost:8983/solr/admin/analysis.jsp:


Your Tokenizer keeps the whole URL as a token.
The WordDelimiterFilter splits on all kinds of things, also removing the "_". 
Thus you get a match on 2010

What you need to do is design a new FieldType in your schema specifically for 
your need.
Choose a Tokenizer based on what you want to be your tokens.
My suggestion is like this:
    <fieldType name="urltype" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
            <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="_" replacement=" UNDERSCORE " />
        <tokenizer class="solr.PatternTokenizerFactory" pattern="[:/\\.\\? ]" />
      </analyzer>
    </fieldType>

Now your tokens will be "http my host SPC265 UNDERSCORE SharePoint UNDERSCORE 
2010 pptx"
A search for url:_2010 would match because the _ is replaced with a special 
token which can then be matched. Proof:


You could do similar thins for other special cases you wish to match. I assume 
that the normal case is that you want to match whole words like sharepoint or 
pptx, and that the _ matching is a special case.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 14. juni 2011, at 11.42, Tirthankar Chatterjee wrote:

> Eric
> Thx for the reply. But what can I do to avoid getting 2010. I wanted a phrase 
> query with underscore, so it would return results with underscore2010 only.
> 
> Sent from iPod
> 
> On Jun 13, 2011, at 3:47 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:
> 
>> You haven't supplied the information that's really
>> needed to help here, please review:
>> 
>> http://wiki.apache.org/solr/UsingMailingLists
>> 
>> But at a guess your analysis chain contains
>> WordDelimiterFilterFactory, which is splitting
>> the input stream into tokens on letter/number
>> changes, and capitalization changes. So you're
>> getting "2010" indexed as a separate token and
>> you're also searching on it...
>> 
>> Best
>> Erick
>> 
>> On Mon, Jun 13, 2011 at 3:07 PM, Tirthankar Chatterjee
>> <tchatter...@commvault.com> wrote:
>>> We are using edismax for query and the query fired is (url:_2010)
>>> 
>>> http://redcarpet2.dm2.commvault.com:27000/solr/select/?q=url: 
>>> 2010&version=2.2&start=0&rows=10&indent=on&defType=edismax<http://redcarpet2.dm2.commvault.com:27000/solr/select/?q=url:%202010&version=2.2&start=0&rows=10&indent=on&defType=edismax>
>>> 
>>> the url field is of type text_rev
>>> 
>>> Results that SOLR returns has 1 extra item which we don't want to get. How 
>>> do we achieve that?
>>> 
>>> Results:
>>> 
>>> SPC265_SharePoint_2010.pptx
>>> OpenTRs2010.xlsx(we don't want this to be returned)
>>> 
>>> 
>>> Thanks in advance!!!
>>> 
>>> Tirthankar
>>> 
>>> 
>>> ******************Legal Disclaimer***************************
>>> "This communication may contain confidential and privileged
>>> material for the sole use of the intended recipient. Any
>>> unauthorized review, use or distribution by others is strictly
>>> prohibited. If you have received the message in error, please
>>> advise the sender by reply email and delete the message. Thank
>>> you."
>>> *********************************************************

Re: Using Edismax

Reply via email to