Re: No documents found for some queries with special chars like m&m

Utkarsh Sengar Tue, 27 Aug 2013 17:17:19 -0700

> Use a different tokenizer, possibly one of the regex ones.
> fake it with phrase queries.
> Take a really good look at the various filter combinations. It's
   possible that WhitespaceTokenizer and WordDelimiterFilterFactory
   might be able to do good things.
Will try to play with these two options.


> Clearly define whether this is capability that you really need.
Yes, this is a needed feature. Some of our queries are at&t, h&m, m&m.
Returning an empty response is not one of the best experience.

I also tried:

                  <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"

 catenateWords="1"

 catenateNumbers="1"

 catenateAll="0"

 preserveOriginal="1"

 types="wdfftypes.txt"/>


With: wdfftypes.txt:
& => ALPHA
\u0026 => ALPHA
$ => DIGIT
% => DIGIT
. => DIGIT
\u002C => DIGIT


But it didn't work.

Thanks,
-Utkarsh




On Tue, Aug 27, 2013 at 3:07 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> bq: Is there a way I can make "m&m" index as one string AND also keep
> StandardTokenizerFactory since I need it for other searches.
>
> In a word, no. You get one and only one tokenizer per field. But there
> are lots of options:
> > Use a different tokenizer, possibly one of the regex ones.
> > fake it with phrase queries.
> > Take a really good look at the various filter combinations. It's
>    possible that WhitespaceTokenizer and WordDelimiterFilterFactory
>    might be able to do good things.
> > Clearly define whether this is capability that you really need.
>
> This last is my recurring plea to insure that the effort is of real benefit
> to the user and not just something someone noticed that's actually
> only useful 0.001% of the time.
>
> Best
> Erick
>
>
> On Tue, Aug 27, 2013 at 5:00 PM, Utkarsh Sengar <utkarsh2...@gmail.com
> >wrote:
>
> > Yup, the query "o'reilly" worked after adding WDF to the index analyser.
> >
> >
> > Although "m&m" or "m\&m" doesn't work.
> > Field analysis for "m&m" says:
> > ST m, m
> > WDF m, m
> >
> > ST m, m
> > WDF m, m
> >
> > So essentially & is ignored during the index or the query. My guess is,
> the
> > standard tokenize is the problem. As the documentation says:
> >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StandardTokenizerFactory
> > Example: "I.B.M. 8.5 can't!!!" ==> ALPHANUM: "I.B.M.", NUM:"8.5",
> > ALPHANUM:"can't"
> >
> > The char "&" will be ignored I guess.
> >
> > *So, my question is:*
> > Is there a way I can make "m&m" index as one string AND also keep
> > StandardTokenizerFactory since I need it for other searches.
> >
> > Thanks,
> > -Utkarsh
> >
> >
> > On Tue, Aug 27, 2013 at 11:44 AM, Utkarsh Sengar <utkarsh2...@gmail.com
> > >wrote:
> >
> > > Thanks for the info.
> > >
> > > 1.
> > >
> >
> http://SERVER/solr/prodinfo/select?q=o%27reilly&wt=json&indent=true&debugQuery=truereturn
> > :
> > >
> > > {
> > >   "responseHeader":{
> > >     "status":0,
> > >     "QTime":16,
> > >     "params":{
> > >       "debugQuery":"true",
> > >       "indent":"true",
> > >       "q":"o'reilly",
> > >       "wt":"json"}},
> > >   "response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]
> > >   },
> > >   "debug":{
> > >     "rawquerystring":"o'reilly",
> > >     "querystring":"o'reilly",
> > >     "parsedquery":"MultiPhraseQuery(allText:\"o'reilly (reilly
> > oreilly)\")",
> > >     "parsedquery_toString":"allText:\"o'reilly (reilly oreilly)\"",
> > >     "QParser":"LuceneQParser",
> > >     "explain":{}
> > >    }
> > > }
> > >
> > >
> > >
> > > 2. Analysis gives this: http://i.imgur.com/IPEiiEQ.png I assume this
> > > means tokens are same for "o'reilly"
> > > 3. I tried escaping ', it doesn’t help:
> > > http://SERVER/solr/prodinfo/select?q=o\%27reilly&wt=json&indent=true<
> > http://SERVER/solr/prodinfo/select?q=o%5C%27reilly&wt=json&indent=true>
> > >
> > > I will add WordDelimiterFilterFactory for index and see if it fixes the
> > > problem.
> > >
> > > Thanks,
> > > -Utkarsh
> > >
> > >
> > >
> > > On Mon, Aug 26, 2013 at 3:15 PM, Erick Erickson <
> erickerick...@gmail.com
> > >wrote:
> > >
> > >> First thing to do is attach &query=debug to your queries and look at
> the
> > >> parsed output.
> > >>
> > >> Second thing to do is look at the admin/analysis page and see what
> > happens
> > >> at index and query time to things like o'reilly. You have
> > >> WordDelimiterFilterFactory
> > >> configured in your query but not index analysis chain. My bet on that
> is
> > >> that
> > >> you're getting different tokens at query and index time...
> > >>
> > >> Third thing is that you need to escape the & character. It's probably
> > >> being
> > >> interpreted as a delimiter on the URL and Solr ignores params it
> doesn't
> > >> understand.
> > >>
> > >> Best
> > >> Erick
> > >>
> > >>
> > >> On Mon, Aug 26, 2013 at 5:08 PM, Utkarsh Sengar <
> utkarsh2...@gmail.com
> > >> >wrote:
> > >>
> > >> > Some of the queries (not all) with special chars return no
> documents.
> > >> >
> > >> > Example: queries returning no documents
> > >> > q=m&m (this can be explained, when I search for "m m", no documents
> > are
> > >> > returned)
> > >> > q=o'reilly (when I search for "o reilly", I get documents back)
> > >> >
> > >> >
> > >> > Queries returning documents:
> > >> > q=hello&world (document matched is "Hello World: A Life in Ham
> Radio")
> > >> >
> > >> >
> > >> > My questions are:
> > >> > 1. What's wrong with "o'reilly"? What changes do I need in my field
> > >> type?
> > >> > 2. How can I make the query "m&m" work?
> > >> > My indexe has a bunch of M&M's docs like: "M & M's Milk Chocolate
> > Candy
> > >> > Coated Peanuts  19.2 oz" and ""M and Ms Chocolate Candies - Peanut
> - 1
> > >> Bag
> > >> > (42 oz)"
> > >> >
> > >> >
> > >> > FIeld type:
> > >> >         <fieldType name="text_general" class="solr.TextField"
> > >> > positionIncrementGap="100">
> > >> >              <analyzer type="index">
> > >> >                   <tokenizer class="solr.StandardTokenizerFactory"/>
> > >> >                   <filter class="solr.StopFilterFactory"
> > >> ignoreCase="true"
> > >> > words="stopwords.txt" enablePositionIncrements="true" />
> > >> >                   <filter class="solr.LowerCaseFilterFactory"/>
> > >> >                   <filter
> > class="solr.EnglishMinimalStemFilterFactory"/>
> > >> >                   <filter class="solr.ASCIIFoldingFilterFactory"/>
> > >> >                   <filter
> > >> class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > >> >             </analyzer>
> > >> >             <analyzer type="query">
> > >> >                   <filter class="solr.WordDelimiterFilterFactory"
> > >> > generateWordParts="1" generateNumberParts="1"
> > >> >
> > >> > catenateWords="1"
> > >> >
> > >> > catenateNumbers="1"
> > >> >
> > >> > catenateAll="0"
> > >> >
> > >> > preserveOriginal="1"/>
> > >> >                   <tokenizer class="solr.StandardTokenizerFactory"/>
> > >> >                   <filter class="solr.StopFilterFactory"
> > >> ignoreCase="true"
> > >> > words="stopwords.txt" enablePositionIncrements="true" />
> > >> >                   <filter class="solr.LowerCaseFilterFactory"/>
> > >> >                   <filter
> > class="solr.EnglishMinimalStemFilterFactory"/>
> > >> >                   <filter class="solr.ASCIIFoldingFilterFactory"/>
> > >> >                   <filter
> > >> class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > >> >             </analyzer>
> > >> >         </fieldType>
> > >> >
> > >> >
> > >> > --
> > >> > Thanks,
> > >> > -Utkarsh
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Thanks,
> > > -Utkarsh
> > >
> >
> >
> >
> > --
> > Thanks,
> > -Utkarsh
> >
>



-- 
Thanks,
-Utkarsh

Re: No documents found for some queries with special chars like m&m

Reply via email to