Hi, Let's assume you're using Solr version 3.1.0 and an unmodified FieldType "text_rev". It looks like this:
<fieldType name="text_rev" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/> </analyzer> ... Also let's assume that what you have two docs in your index with these URLs: A:"http://my.host/SPC265_SharePoint_2010.pptx" B:"http://my.host/OpenTRs2010.xlsx" Now you want to match only A and not B, and you attempt that using q=url:_2010 What happens here can easily be simulated by http://localhost:8983/solr/admin/analysis.jsp:
Your Tokenizer keeps the whole URL as a token. The WordDelimiterFilter splits on all kinds of things, also removing the "_". Thus you get a match on 2010 What you need to do is design a new FieldType in your schema specifically for your need. Choose a Tokenizer based on what you want to be your tokens. My suggestion is like this: <fieldType name="urltype" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="_" replacement=" UNDERSCORE " /> <tokenizer class="solr.PatternTokenizerFactory" pattern="[:/\\.\\? ]" /> </analyzer> </fieldType> Now your tokens will be "http my host SPC265 UNDERSCORE SharePoint UNDERSCORE 2010 pptx" A search for url:_2010 would match because the _ is replaced with a special token which can then be matched. Proof:
You could do similar thins for other special cases you wish to match. I assume that the normal case is that you want to match whole words like sharepoint or pptx, and that the _ matching is a special case. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 14. juni 2011, at 11.42, Tirthankar Chatterjee wrote: > Eric > Thx for the reply. But what can I do to avoid getting 2010. I wanted a phrase > query with underscore, so it would return results with underscore2010 only. > > Sent from iPod > > On Jun 13, 2011, at 3:47 PM, "Erick Erickson" <erickerick...@gmail.com> wrote: > >> You haven't supplied the information that's really >> needed to help here, please review: >> >> http://wiki.apache.org/solr/UsingMailingLists >> >> But at a guess your analysis chain contains >> WordDelimiterFilterFactory, which is splitting >> the input stream into tokens on letter/number >> changes, and capitalization changes. So you're >> getting "2010" indexed as a separate token and >> you're also searching on it... >> >> Best >> Erick >> >> On Mon, Jun 13, 2011 at 3:07 PM, Tirthankar Chatterjee >> <tchatter...@commvault.com> wrote: >>> We are using edismax for query and the query fired is (url:_2010) >>> >>> http://redcarpet2.dm2.commvault.com:27000/solr/select/?q=url: >>> 2010&version=2.2&start=0&rows=10&indent=on&defType=edismax<http://redcarpet2.dm2.commvault.com:27000/solr/select/?q=url:%202010&version=2.2&start=0&rows=10&indent=on&defType=edismax> >>> >>> the url field is of type text_rev >>> >>> Results that SOLR returns has 1 extra item which we don't want to get. How >>> do we achieve that? >>> >>> Results: >>> >>> SPC265_SharePoint_2010.pptx >>> OpenTRs2010.xlsx(we don't want this to be returned) >>> >>> >>> Thanks in advance!!! >>> >>> Tirthankar >>> >>> >>> ******************Legal Disclaimer*************************** >>> "This communication may contain confidential and privileged >>> material for the sole use of the intended recipient. Any >>> unauthorized review, use or distribution by others is strictly >>> prohibited. If you have received the message in error, please >>> advise the sender by reply email and delete the message. Thank >>> you." >>> *********************************************************