Re: Too many results in dismax queries with one word

Erick Erickson Sun, 21 Aug 2011 09:37:22 -0700

I think Sujit has hit the nail on the head. Any program you try to write
that tries to guess what the user *really* meant will require endless
tinkering and *still* won't be right. If you only knew how annoying I
find Google's attempts to "help".....


So perhaps concentrating on some interaction with the user, who is,
after all, the only one who really knows what they want is the best approach.

Best
Erick

On Sun, Aug 21, 2011 at 12:26 PM, Sujit Pal <sujit....@comcast.net> wrote:
> Would it make sense to have a "Did you mean?" type of functionality for
> which you use the EdgeNGram and Metaphone filters /if/ you don't get
> appropriate results for the user query?
>
> So when user types "cannon" and the application notices that there are
> no cannons for sale in the index (0 results with standard analysis), it
> then makes another query with the EdgeNGram and/or Metaphone filters and
> come back with:
>
> Did you mean "Canon", "Canine"?
>
> Clicking on "Canon" or "Canine" would fire off a query for these terms.
>
> That way your application doesn't guess what is right, it goes back and
> asks the user what he wants.
>
> -sujit
>
> On Sun, 2011-08-21 at 17:19 +0200, Rafał Piekarski (RaVbaker) wrote:
>> Thanks for reply. I know that sometimes meeting all clients needs would be
>> impossible but then client recalls that competitive (commercial) product
>> already do that (but has other problems, like performance). And then I'm
>> obligated to try more tricks. :/
>>
>> I'm currently using Solr 3.1 but thinking about migrating to latest stable
>> version - 3.3.
>>
>> You correct, to meet client needs I have also used some hacks with boosting
>> queries (`bq` and `bf` parameters) but I omit that to make XMLs clearer.
>>
>> You mentioned faceting. This is also one of my(my client?) problems. In the
>> user interface they want to have 5 categories for products. Those 5 should
>> be most relevance ones. When I get those with highest counts for one word
>> queries they are most of the time "not that which should be there". For
>> example with phrase "ipad" which actually has only 12 most relevant products
>> in category "tablets" but phonetic APT matches also part of model name for
>> hundreds of UPS power supplies and bath tubes . And these are on the list,
>> not tablets. :/
>>
>> But you mentioned autocomplete which is something what I haven't watched
>> yet. I'll try with that and show it to my client.
>>
>> --
>> Rafał "RaVbaker" Piekarski.
>>
>> web: http://ja.ravbaker.net
>> mail: ravba...@gmail.com
>> jid/xmpp/aim: ravba...@gmail.com
>> mobile: +48-663-808-481
>>
>>
>> On Sun, Aug 21, 2011 at 4:20 PM, Erick Erickson 
>> <erickerick...@gmail.com>wrote:
>>
>> > The root problem here is "This is unacceptable for my client". The first
>> > thing I'd suggest is that you work with your client and get them to define
>> > what is acceptable. You'll be forever changing things (to no good purpose)
>> > if all they can say is "that's not right".
>> >
>> > For instance, you apparently have two competing requirements:
>> > 1> try to correct users input, which inevitably increases the results
>> > returned
>> > 2> narrow the search to the "right" results.
>> >
>> > You can't have both every time!
>> >
>> > So you could try something like going with a more-restrictive search
>> > (no metaphone
>> > comparison) first and, if the results returned weren't sufficient
>> > firing the "broader" query
>> > back, without showing the too-small results first.
>> >
>> > You could work with your client and see if what they really want is
>> > just the most relevant
>> > results at the top of the list, in which case you can play with the
>> > dismax field boosts
>> > (by the way, what version of Solr are you using?)
>> >
>> > You could work with the client to understand the user experience if
>> > you use autocomplete
>> > and/or faceting etc. to guide their explorations.
>> >
>> > You could...
>> >
>> > But none of that will help unless and until you and your client can
>> > agree what is the
>> > correct behavior ahead of time
>> >
>> > Best
>> > Erick
>> >
>> > On Sat, Aug 20, 2011 at 11:04 AM, Rafał Piekarski (RaVbaker)
>> > <ravba...@gmail.com> wrote:
>> > > Hi all,
>> > >
>> > > I have a database of e-commerce products (5M) and trying to build a
>> > search
>> > > solution for it.
>> > >
>> > > I have used steemer, edgengram and doublemetaphone phonetic fields for
>> > > omiting common typos in queries.  It works quite good with dismax QParser
>> > > for queries longer than one word: "tv lc20", "sny psp 3001", "cannon 5d"
>> > > etc. For not having too many results I manipulated with `mm` parameter.
>> > But
>> > > when user type a single word like "ipad", "cannon". I always having a lot
>> > of
>> > > results (~60000). This is unacceptable for my client. He would like to
>> > have
>> > > then only the `good` results. That particulary match specific query. It's
>> > > hard to acomplish for me cause of use doublemetaphone field which
>> > converts
>> > > words like "apt", "opt" and "ipad" and even "ipod" to the same phonetic
>> > word
>> > > - APT. And then all of these  words are matched fairly the same gives me
>> > > huge amount of results. Similar problems I have with other words like
>> > > "canon", "canine" and "cannon" which are KNN in phonetic way. But
>> > lexically
>> > > have different meanings: "canon" - camera, "canine" - cat food , "cannon"
>> > -
>> > > may be a misspell for canon or part of book title about cannon weapons.
>> > >
>> > > My first idea was to make a second requestHandler without searching in
>> > > *_phonetic fields. And use it for queries with only one word. But it
>> > didn't
>> > > worked cause sometimes I want to correct user even if there is only one
>> > word
>> > > and suggest him something better. Query "cannon" is a good example. I'm
>> > > fairly sure that most of the time when someone type "cannon" it would be
>> > a
>> > > typo for "canon" and I want to show user also CANON cameras. That's why I
>> > > can't use second requestHandler for one word queries.
>> > >
>> > > I'm looking for any ideas how could I change my requestHandler.
>> > >
>> > > My regular queries are: http://localhost:8983/solr/select?q=cannon
>> > >
>> > > Below I put my configuration for requestHandler and schema.xml.
>> > >
>> > >
>> > >
>> > > solrconfig.xml:
>> > >
>> > > <requestHandler name="search" class="solr.SearchHandler" default="true">
>> > >   <lst name="defaults">
>> > > <str name="q.alt">*:*</str>
>> > >     <str name="defType">dismax</str>
>> > >     <str name="qf">
>> > >         title^1.3 title_text^0.9 title_phonetic^0.74 title_ng^0.17
>> > >         title_ngram^0.54
>> > >         producer_name^0.9 producer_name_text^0.89
>> > >         category_path_text^0.8 category_path_phonetic^0.65
>> > >         description^0.60 description_text^0.56
>> > >     </str>
>> > >     <str name="pf">title_text^1.1 title^1.2 description^0.3</str>
>> > >     <int name="ps">3</int>
>> > >     <str name="tie">0.1</str>
>> > >     <str name="mm">2&lt;100% 3&lt;-1 5&lt;85%</str>
>> > >
>> > >     <str name="fl">*,score</str>
>> > > </lst>
>> > > </requestHandler>
>> > >
>> > >
>> > > schema.xml:
>> > >
>> > > <?xml version="1.0" encoding="UTF-8" ?>
>> > > <schema name="XX" version="1.2">
>> > >    <types>
>> > >        <fieldType name="int" class="solr.TrieIntField" precisionStep="0"
>> > > omitNorms="true" positionIncrementGap="0" />
>> > >    <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
>> > > omitNorms="true" positionIncrementGap="0"/>
>> > >        <fieldType name="string" class="solr.StrField"
>> > > sortMissingLast="true" omitNorms="true" />
>> > >        <fieldType name="boolean" class="solr.BoolField"
>> > > sortMissingLast="true" omitNorms="true" />
>> > >        <fieldType name="decimal" class="solr.TrieFloatField"
>> > > precisionStep="2" omitNorms="true" positionIncrementGap="0" />
>> > >
>> > >        <fieldType name="text" class="solr.TextField"
>> > > positionIncrementGap="100">
>> > >            <analyzer>
>> > >                <charFilter class="solr.HTMLStripCharFilterFactory"/>
>> > >                <tokenizer class="solr.WhitespaceTokenizerFactory" />
>> > >        <!-- Case insensitive stop word removal.
>> > >          add enablePositionIncrements=true in both the index and query
>> > >          analyzers to leave a 'gap' for more accurate phrase queries.
>> > >        -->
>> > >        <filter class="solr.StopFilterFactory"
>> > >                ignoreCase="true"
>> > >                                words="stopwords_pl.txt"
>> > >                enablePositionIncrements="true"
>> > >                />
>> > >        <filter class="solr.WordDelimiterFilterFactory"
>> > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>> > >
>> > >                <filter class="solr.LowerCaseFilterFactory" />
>> > >                <filter class="solr.TrimFilterFactory" />
>> > > <filter class="solr.StempelPolishStemFilterFactory" />
>> > >            </analyzer>
>> > >        </fieldType>
>> > >
>> > >    <fieldType name="text_gen" class="solr.TextField"
>> > > positionIncrementGap="100">
>> > >            <analyzer>
>> > >                <charFilter class="solr.HTMLStripCharFilterFactory"/>
>> > >                <tokenizer class="solr.WhitespaceTokenizerFactory" />
>> > >        <filter class="solr.StopFilterFactory"
>> > >                ignoreCase="true"
>> > >                words="stopwords_pl.txt"
>> > >                enablePositionIncrements="true"
>> > >                />
>> > >        <filter class="solr.WordDelimiterFilterFactory"
>> > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>> > >
>> > >                <filter class="solr.LowerCaseFilterFactory" />
>> > >                <filter class="solr.TrimFilterFactory" />
>> > >            </analyzer>
>> > >        </fieldType>
>> > >
>> > >
>> > >    <fieldtype name="phonetic" stored="false" indexed="true"
>> > > class="solr.TextField" >
>> > >      <analyzer>
>> > >        <tokenizer class="solr.StandardTokenizerFactory"/>
>> > >        <filter class="solr.StopFilterFactory"
>> > >                ignoreCase="true"
>> > >                words="stopwords_pl.txt"
>> > >                enablePositionIncrements="true"
>> > >                />
>> > >        <filter class="solr.DoubleMetaphoneFilterFactory" inject="false"
>> > > maxCodeLength="8"/>
>> > >      </analyzer>
>> > >    </fieldtype>
>> > >
>> > >  <fieldtype name="ngram" class="solr.TextField">
>> > >   <analyzer type="index">
>> > >                <tokenizer class="solr.StandardTokenizerFactory"/>
>> > >      <filter class="solr.LowerCaseFilterFactory"/>
>> > >        <filter class="solr.StopFilterFactory"
>> > >                ignoreCase="true"
>> > >                words="stopwords_pl.txt"
>> > >                enablePositionIncrements="true"
>> > >                />
>> > >                <filter class="solr.WordDelimiterFilterFactory"
>> > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>> > >
>> > >                        <filter class="solr.NGramFilterFactory"
>> > > minGramSize="2" maxGramSize="3" />
>> > >                    </analyzer>
>> > >                    <analyzer type="query">
>> > >                <tokenizer class="solr.StandardTokenizerFactory"/>
>> > >      <filter class="solr.LowerCaseFilterFactory"/>
>> > >                        <filter class="solr.NGramFilterFactory"
>> > > minGramSize="2" maxGramSize="3" />
>> > >                    </analyzer>
>> > >                 </fieldtype>
>> > >
>> > > <fieldtype name="edgengram" class="solr.TextField">
>> > >   <analyzer>
>> > >                <tokenizer class="solr.StandardTokenizerFactory"/>
>> > >      <filter class="solr.LowerCaseFilterFactory"/>
>> > >        <filter class="solr.StopFilterFactory"
>> > >                ignoreCase="true"
>> > >                words="stopwords_pl.txt"
>> > >                enablePositionIncrements="true"
>> > >                />
>> > >         <filter class="solr.WordDelimiterFilterFactory"
>> > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>> > >
>> > >     <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
>> > > maxGramSize="15" side="front"/>
>> > >
>> > >  </analyzer>
>> > >                 </fieldtype>
>> > >
>> > >
>> > >    </types>
>> > >    <fields>
>> > >        <field name="id" type="string" indexed="true" stored="true"
>> > > required="true" />
>> > >        <field name="title" type="text_gen" indexed="true" stored="true"
>> > > required="true" />
>> > >        <field name="category_path" type="string" indexed="true"
>> > > stored="true" />
>> > >
>> > >        <field name="producer_name" type="string" indexed="true"
>> > > stored="false" />
>> > >        <field name="description" type="text_gen" indexed="false"
>> > > stored="true" />
>> > >
>> > >  <dynamicField name="*_text" type="text" indexed="true" stored="false" />
>> > >
>> > >  <dynamicField name="*_ascii" type="text_ascii" indexed="true"
>> > > stored="false" />
>> > >  <dynamicField name="*_phonetic" type="phonetic" indexed="true"
>> > > stored="false" />
>> > >  <dynamicField name="*_ng" type="edgengram" indexed="true" stored="false"
>> > />
>> > >
>> > >  <dynamicField name="*_ngram" type="ngram" indexed="true" stored="false"
>> > />
>> > >
>> > >
>> > >    </fields>
>> > >    <uniqueKey>id</uniqueKey>
>> > >    <defaultSearchField>title</defaultSearchField>
>> > >    <solrQueryParser defaultOperator="AND" />
>> > >
>> > >    <copyField source="title" dest="title_sort" />
>> > >  <copyField source="title" dest="title_text" />
>> > > <copyField source="title" dest="title_ascii" />
>> > >    <copyField source="title" dest="title_phonetic" />
>> > >    <copyField source="title" dest="title_ng" />
>> > >    <copyField source="title" dest="title_ngram"/>
>> > >
>> > >  <copyField source="producer_name" dest="producer_name_text" />
>> > >  <copyField source="producer_name" dest="producer_name_phonetic" />
>> > >
>> > >    <copyField source="category_path" dest="category_path_text" />
>> > > <copyField source="category_path" dest="category_path_phonetic" />
>> > >   <copyField source="description" dest="description_text" />
>> > >
>> > > </schema>
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Rafał "RaVbaker" Piekarski.
>> > >
>> > > web: http://ja.ravbaker.net
>> > > mail: ravba...@gmail.com
>> > > jid/xmpp/aim: ravba...@gmail.com
>> > > mobile: +48-663-808-481
>> > >
>> >
>
>

Re: Too many results in dismax queries with one word

Reply via email to