Thanks very much for your advice. I think I now better understand how to
make better use of solr. I have tested spellchecker and it looks like it let
me to achieve better results and hopefully we will satisfy the client.

In my solution I will change user query to use or not to use phonetic fields
based on results from spellcheck.collation and frequency of words. If I
wouldn't be sure what is better then I'll ask user through "did you mean"
and log his reply to make better choices in future.

Once again thanks a lot guys.

This is my example of query to spellchecker:

http://localhost:8983/solr/select?spellcheck=true&q=cannon&rows=0&spellcheck.collate=true&spellcheck.count=10&spellcheck.onlyMorePopular=true&spellcheck.extendedResults=on

-- 
Rafał "RaVbaker" Piekarski.

web: http://ja.ravbaker.net
mail: ravba...@gmail.com
jid/xmpp/aim: ravba...@gmail.com
mobile: +48-663-808-481


On Sun, Aug 21, 2011 at 6:36 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> I think Sujit has hit the nail on the head. Any program you try to write
> that tries to guess what the user *really* meant will require endless
> tinkering and *still* won't be right. If you only knew how annoying I
> find Google's attempts to "help".....
>
> So perhaps concentrating on some interaction with the user, who is,
> after all, the only one who really knows what they want is the best
> approach.
>
> Best
> Erick
>
> On Sun, Aug 21, 2011 at 12:26 PM, Sujit Pal <sujit....@comcast.net> wrote:
> > Would it make sense to have a "Did you mean?" type of functionality for
> > which you use the EdgeNGram and Metaphone filters /if/ you don't get
> > appropriate results for the user query?
> >
> > So when user types "cannon" and the application notices that there are
> > no cannons for sale in the index (0 results with standard analysis), it
> > then makes another query with the EdgeNGram and/or Metaphone filters and
> > come back with:
> >
> > Did you mean "Canon", "Canine"?
> >
> > Clicking on "Canon" or "Canine" would fire off a query for these terms.
> >
> > That way your application doesn't guess what is right, it goes back and
> > asks the user what he wants.
> >
> > -sujit
> >
> > On Sun, 2011-08-21 at 17:19 +0200, Rafał Piekarski (RaVbaker) wrote:
> >> Thanks for reply. I know that sometimes meeting all clients needs would
> be
> >> impossible but then client recalls that competitive (commercial) product
> >> already do that (but has other problems, like performance). And then I'm
> >> obligated to try more tricks. :/
> >>
> >> I'm currently using Solr 3.1 but thinking about migrating to latest
> stable
> >> version - 3.3.
> >>
> >> You correct, to meet client needs I have also used some hacks with
> boosting
> >> queries (`bq` and `bf` parameters) but I omit that to make XMLs clearer.
> >>
> >> You mentioned faceting. This is also one of my(my client?) problems. In
> the
> >> user interface they want to have 5 categories for products. Those 5
> should
> >> be most relevance ones. When I get those with highest counts for one
> word
> >> queries they are most of the time "not that which should be there". For
> >> example with phrase "ipad" which actually has only 12 most relevant
> products
> >> in category "tablets" but phonetic APT matches also part of model name
> for
> >> hundreds of UPS power supplies and bath tubes . And these are on the
> list,
> >> not tablets. :/
> >>
> >> But you mentioned autocomplete which is something what I haven't watched
> >> yet. I'll try with that and show it to my client.
> >>
> >> --
> >> Rafał "RaVbaker" Piekarski.
> >>
> >> web: http://ja.ravbaker.net
> >> mail: ravba...@gmail.com
> >> jid/xmpp/aim: ravba...@gmail.com
> >> mobile: +48-663-808-481
> >>
> >>
> >> On Sun, Aug 21, 2011 at 4:20 PM, Erick Erickson <
> erickerick...@gmail.com>wrote:
> >>
> >> > The root problem here is "This is unacceptable for my client". The
> first
> >> > thing I'd suggest is that you work with your client and get them to
> define
> >> > what is acceptable. You'll be forever changing things (to no good
> purpose)
> >> > if all they can say is "that's not right".
> >> >
> >> > For instance, you apparently have two competing requirements:
> >> > 1> try to correct users input, which inevitably increases the results
> >> > returned
> >> > 2> narrow the search to the "right" results.
> >> >
> >> > You can't have both every time!
> >> >
> >> > So you could try something like going with a more-restrictive search
> >> > (no metaphone
> >> > comparison) first and, if the results returned weren't sufficient
> >> > firing the "broader" query
> >> > back, without showing the too-small results first.
> >> >
> >> > You could work with your client and see if what they really want is
> >> > just the most relevant
> >> > results at the top of the list, in which case you can play with the
> >> > dismax field boosts
> >> > (by the way, what version of Solr are you using?)
> >> >
> >> > You could work with the client to understand the user experience if
> >> > you use autocomplete
> >> > and/or faceting etc. to guide their explorations.
> >> >
> >> > You could...
> >> >
> >> > But none of that will help unless and until you and your client can
> >> > agree what is the
> >> > correct behavior ahead of time
> >> >
> >> > Best
> >> > Erick
> >> >
> >> > On Sat, Aug 20, 2011 at 11:04 AM, Rafał Piekarski (RaVbaker)
> >> > <ravba...@gmail.com> wrote:
> >> > > Hi all,
> >> > >
> >> > > I have a database of e-commerce products (5M) and trying to build a
> >> > search
> >> > > solution for it.
> >> > >
> >> > > I have used steemer, edgengram and doublemetaphone phonetic fields
> for
> >> > > omiting common typos in queries.  It works quite good with dismax
> QParser
> >> > > for queries longer than one word: "tv lc20", "sny psp 3001", "cannon
> 5d"
> >> > > etc. For not having too many results I manipulated with `mm`
> parameter.
> >> > But
> >> > > when user type a single word like "ipad", "cannon". I always having
> a lot
> >> > of
> >> > > results (~60000). This is unacceptable for my client. He would like
> to
> >> > have
> >> > > then only the `good` results. That particulary match specific query.
> It's
> >> > > hard to acomplish for me cause of use doublemetaphone field which
> >> > converts
> >> > > words like "apt", "opt" and "ipad" and even "ipod" to the same
> phonetic
> >> > word
> >> > > - APT. And then all of these  words are matched fairly the same
> gives me
> >> > > huge amount of results. Similar problems I have with other words
> like
> >> > > "canon", "canine" and "cannon" which are KNN in phonetic way. But
> >> > lexically
> >> > > have different meanings: "canon" - camera, "canine" - cat food ,
> "cannon"
> >> > -
> >> > > may be a misspell for canon or part of book title about cannon
> weapons.
> >> > >
> >> > > My first idea was to make a second requestHandler without searching
> in
> >> > > *_phonetic fields. And use it for queries with only one word. But it
> >> > didn't
> >> > > worked cause sometimes I want to correct user even if there is only
> one
> >> > word
> >> > > and suggest him something better. Query "cannon" is a good example.
> I'm
> >> > > fairly sure that most of the time when someone type "cannon" it
> would be
> >> > a
> >> > > typo for "canon" and I want to show user also CANON cameras. That's
> why I
> >> > > can't use second requestHandler for one word queries.
> >> > >
> >> > > I'm looking for any ideas how could I change my requestHandler.
> >> > >
> >> > > My regular queries are: http://localhost:8983/solr/select?q=cannon
> >> > >
> >> > > Below I put my configuration for requestHandler and schema.xml.
> >> > >
> >> > >
> >> > >
> >> > > solrconfig.xml:
> >> > >
> >> > > <requestHandler name="search" class="solr.SearchHandler"
> default="true">
> >> > >   <lst name="defaults">
> >> > > <str name="q.alt">*:*</str>
> >> > >     <str name="defType">dismax</str>
> >> > >     <str name="qf">
> >> > >         title^1.3 title_text^0.9 title_phonetic^0.74 title_ng^0.17
> >> > >         title_ngram^0.54
> >> > >         producer_name^0.9 producer_name_text^0.89
> >> > >         category_path_text^0.8 category_path_phonetic^0.65
> >> > >         description^0.60 description_text^0.56
> >> > >     </str>
> >> > >     <str name="pf">title_text^1.1 title^1.2 description^0.3</str>
> >> > >     <int name="ps">3</int>
> >> > >     <str name="tie">0.1</str>
> >> > >     <str name="mm">2&lt;100% 3&lt;-1 5&lt;85%</str>
> >> > >
> >> > >     <str name="fl">*,score</str>
> >> > > </lst>
> >> > > </requestHandler>
> >> > >
> >> > >
> >> > > schema.xml:
> >> > >
> >> > > <?xml version="1.0" encoding="UTF-8" ?>
> >> > > <schema name="XX" version="1.2">
> >> > >    <types>
> >> > >        <fieldType name="int" class="solr.TrieIntField"
> precisionStep="0"
> >> > > omitNorms="true" positionIncrementGap="0" />
> >> > >    <fieldType name="long" class="solr.TrieLongField"
> precisionStep="0"
> >> > > omitNorms="true" positionIncrementGap="0"/>
> >> > >        <fieldType name="string" class="solr.StrField"
> >> > > sortMissingLast="true" omitNorms="true" />
> >> > >        <fieldType name="boolean" class="solr.BoolField"
> >> > > sortMissingLast="true" omitNorms="true" />
> >> > >        <fieldType name="decimal" class="solr.TrieFloatField"
> >> > > precisionStep="2" omitNorms="true" positionIncrementGap="0" />
> >> > >
> >> > >        <fieldType name="text" class="solr.TextField"
> >> > > positionIncrementGap="100">
> >> > >            <analyzer>
> >> > >                <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >> > >                <tokenizer class="solr.WhitespaceTokenizerFactory" />
> >> > >        <!-- Case insensitive stop word removal.
> >> > >          add enablePositionIncrements=true in both the index and
> query
> >> > >          analyzers to leave a 'gap' for more accurate phrase
> queries.
> >> > >        -->
> >> > >        <filter class="solr.StopFilterFactory"
> >> > >                ignoreCase="true"
> >> > >                                words="stopwords_pl.txt"
> >> > >                enablePositionIncrements="true"
> >> > >                />
> >> > >        <filter class="solr.WordDelimiterFilterFactory"
> >> > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >> > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >> > >
> >> > >                <filter class="solr.LowerCaseFilterFactory" />
> >> > >                <filter class="solr.TrimFilterFactory" />
> >> > > <filter class="solr.StempelPolishStemFilterFactory" />
> >> > >            </analyzer>
> >> > >        </fieldType>
> >> > >
> >> > >    <fieldType name="text_gen" class="solr.TextField"
> >> > > positionIncrementGap="100">
> >> > >            <analyzer>
> >> > >                <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >> > >                <tokenizer class="solr.WhitespaceTokenizerFactory" />
> >> > >        <filter class="solr.StopFilterFactory"
> >> > >                ignoreCase="true"
> >> > >                words="stopwords_pl.txt"
> >> > >                enablePositionIncrements="true"
> >> > >                />
> >> > >        <filter class="solr.WordDelimiterFilterFactory"
> >> > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >> > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >> > >
> >> > >                <filter class="solr.LowerCaseFilterFactory" />
> >> > >                <filter class="solr.TrimFilterFactory" />
> >> > >            </analyzer>
> >> > >        </fieldType>
> >> > >
> >> > >
> >> > >    <fieldtype name="phonetic" stored="false" indexed="true"
> >> > > class="solr.TextField" >
> >> > >      <analyzer>
> >> > >        <tokenizer class="solr.StandardTokenizerFactory"/>
> >> > >        <filter class="solr.StopFilterFactory"
> >> > >                ignoreCase="true"
> >> > >                words="stopwords_pl.txt"
> >> > >                enablePositionIncrements="true"
> >> > >                />
> >> > >        <filter class="solr.DoubleMetaphoneFilterFactory"
> inject="false"
> >> > > maxCodeLength="8"/>
> >> > >      </analyzer>
> >> > >    </fieldtype>
> >> > >
> >> > >  <fieldtype name="ngram" class="solr.TextField">
> >> > >   <analyzer type="index">
> >> > >                <tokenizer class="solr.StandardTokenizerFactory"/>
> >> > >      <filter class="solr.LowerCaseFilterFactory"/>
> >> > >        <filter class="solr.StopFilterFactory"
> >> > >                ignoreCase="true"
> >> > >                words="stopwords_pl.txt"
> >> > >                enablePositionIncrements="true"
> >> > >                />
> >> > >                <filter class="solr.WordDelimiterFilterFactory"
> >> > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >> > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >> > >
> >> > >                        <filter class="solr.NGramFilterFactory"
> >> > > minGramSize="2" maxGramSize="3" />
> >> > >                    </analyzer>
> >> > >                    <analyzer type="query">
> >> > >                <tokenizer class="solr.StandardTokenizerFactory"/>
> >> > >      <filter class="solr.LowerCaseFilterFactory"/>
> >> > >                        <filter class="solr.NGramFilterFactory"
> >> > > minGramSize="2" maxGramSize="3" />
> >> > >                    </analyzer>
> >> > >                 </fieldtype>
> >> > >
> >> > > <fieldtype name="edgengram" class="solr.TextField">
> >> > >   <analyzer>
> >> > >                <tokenizer class="solr.StandardTokenizerFactory"/>
> >> > >      <filter class="solr.LowerCaseFilterFactory"/>
> >> > >        <filter class="solr.StopFilterFactory"
> >> > >                ignoreCase="true"
> >> > >                words="stopwords_pl.txt"
> >> > >                enablePositionIncrements="true"
> >> > >                />
> >> > >         <filter class="solr.WordDelimiterFilterFactory"
> >> > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >> > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >> > >
> >> > >     <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
> >> > > maxGramSize="15" side="front"/>
> >> > >
> >> > >  </analyzer>
> >> > >                 </fieldtype>
> >> > >
> >> > >
> >> > >    </types>
> >> > >    <fields>
> >> > >        <field name="id" type="string" indexed="true" stored="true"
> >> > > required="true" />
> >> > >        <field name="title" type="text_gen" indexed="true"
> stored="true"
> >> > > required="true" />
> >> > >        <field name="category_path" type="string" indexed="true"
> >> > > stored="true" />
> >> > >
> >> > >        <field name="producer_name" type="string" indexed="true"
> >> > > stored="false" />
> >> > >        <field name="description" type="text_gen" indexed="false"
> >> > > stored="true" />
> >> > >
> >> > >  <dynamicField name="*_text" type="text" indexed="true"
> stored="false" />
> >> > >
> >> > >  <dynamicField name="*_ascii" type="text_ascii" indexed="true"
> >> > > stored="false" />
> >> > >  <dynamicField name="*_phonetic" type="phonetic" indexed="true"
> >> > > stored="false" />
> >> > >  <dynamicField name="*_ng" type="edgengram" indexed="true"
> stored="false"
> >> > />
> >> > >
> >> > >  <dynamicField name="*_ngram" type="ngram" indexed="true"
> stored="false"
> >> > />
> >> > >
> >> > >
> >> > >    </fields>
> >> > >    <uniqueKey>id</uniqueKey>
> >> > >    <defaultSearchField>title</defaultSearchField>
> >> > >    <solrQueryParser defaultOperator="AND" />
> >> > >
> >> > >    <copyField source="title" dest="title_sort" />
> >> > >  <copyField source="title" dest="title_text" />
> >> > > <copyField source="title" dest="title_ascii" />
> >> > >    <copyField source="title" dest="title_phonetic" />
> >> > >    <copyField source="title" dest="title_ng" />
> >> > >    <copyField source="title" dest="title_ngram"/>
> >> > >
> >> > >  <copyField source="producer_name" dest="producer_name_text" />
> >> > >  <copyField source="producer_name" dest="producer_name_phonetic" />
> >> > >
> >> > >    <copyField source="category_path" dest="category_path_text" />
> >> > > <copyField source="category_path" dest="category_path_phonetic" />
> >> > >   <copyField source="description" dest="description_text" />
> >> > >
> >> > > </schema>
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Rafał "RaVbaker" Piekarski.
> >> > >
> >> > > web: http://ja.ravbaker.net
> >> > > mail: ravba...@gmail.com
> >> > > jid/xmpp/aim: ravba...@gmail.com
> >> > > mobile: +48-663-808-481
> >> > >
> >> >
> >
> >
>

Reply via email to