Re: Relevancy Scoring

Doug Turnbull Mon, 18 May 2015 17:23:03 -0700

Glad you figured things out and found splainer useful! Pull requests, bugs,
feature requests welcome!


https://github.com/o19s/splainer

Doug

On Monday, May 18, 2015, John Blythe <j...@curvolabs.com> wrote:

> Doug,
>
> very very cool tool you've made there. thanks so much for sharing!
>
> i ended up removing the shinglefilterfactory and voila! things are back in
> good, working order with some great matching. i'm not 100% certain as to
> why shingling was so ineffective. i'm guessing the stacked terms created
> lower relevancy due to IDF on the *joint *terms/token?
>
> --
> *John Blythe*
> Product Manager & Lead Developer
>
> 251.605.3071 | j...@curvolabs.com <javascript:;>
> www.curvolabs.com
>
> 58 Adams Ave
> Evansville, IN 47713
>
> On Mon, May 18, 2015 at 4:57 PM, John Blythe <j...@curvolabs.com
> <javascript:;>> wrote:
>
> > Doug,
> >
> > A couple things quickly:
> > - I'll check in to that. How would you go about testing things, direct
> > URL? If so, how would you compose one of the examples above?
> > - yup, I used it extensively before testing scores to ensure that I was
> > getting things parsed appropriately (segmenting off the unit of measure
> > [mm] whilst still maintaining the decimal instead of breaking it up was
> my
> > largest concern as of late)
> > - to that point, though, it looks like one of my blunders was in the
> > synonyms file. i just referenced /analysis/ again and realized "CANN" was
> > being transposed to "cannula" instead of "cannulated" #facepalm
> > - i'll be GLAD to use that! i'd been trying to use
> http://explain.solr.pl/
> > previously but it kept error'ing out on me :\
> >
> > thanks again, will report back!
> >
> > --
> > *John Blythe*
> > Product Manager & Lead Developer
> >
> > 251.605.3071 | j...@curvolabs.com <javascript:;>
> > www.curvolabs.com
> >
> > 58 Adams Ave
> > Evansville, IN 47713
> >
> > On Mon, May 18, 2015 at 4:47 PM, Doug Turnbull <
> > dturnb...@opensourceconnections.com <javascript:;>> wrote:
> >
> >> Hey John,
> >>
> >> I think you likely do need to think about escaping the query operators.
> I
> >> doubt the Solr admin could tell the difference.
> >>
> >> For analysis, have you looked at the handy analysis tool in the Solr
> Admin
> >> UI? Its pretty indespensible for figuring out if an analyzed query
> matches
> >> an analyzed field.
> >>
> >> Outside of that, I can selfishly plug Splainer (http://splainer.io)
> that
> >> gives you more insight into the Solr relevance explain. You would paste
> in
> >> something like
> >> http://solr.quepid.com/solr/statedecoded/select?q=text:(deer%20hunting)
> .
> >>
> >> Cheers!
> >> -Doug
> >>
> >> On Mon, May 18, 2015 at 3:02 PM, John Blythe <j...@curvolabs.com
> <javascript:;>> wrote:
> >>
> >> > Thanks again for the speediness, Doug.
> >> >
> >> > Good to know on some of those things, not least of all the +
> indicating
> >> a
> >> > mandatory field and the parentheses. It seems like the escaping is
> >> pretty
> >> > robust in light of the product number.
> >> >
> >> > I'm thinking it has to be largely related to the analyzer. Check this
> >> out,
> >> > this time with more of a real world case for us. Searching for
> >> "descript2:
> >> > CANN SCREW PT 3.5X50MM" produces a top result that has "Cannulated
> >> screw PT
> >> > 4.0x40mm" as its description. There is a document, though, that has
> the
> >> > description of "Cannulated screw PT 3.5x50mm"—the exact same thing
> >> (minus
> >> > lowercases) rendering that the analyzer is producing (per the
> /analysis
> >> > page). Why would 4.0x40 come up first?  The top four results have
> >> > 4.0x[Something]. It's not till the fifth result that you see a 3.5
> >> > something: "Cannulated screw PT 3.5x105mm" at which point I'm saying
> >> WTF.
> >> > So close, but then it ignores the "50" for a "105" instead.
> >> >
> >> > Further, adding parenthesis around the phrase—"descript2: (CANN SCREW
> PT
> >> > 3.5X50MM)"—produces top results that have the correct
> >> dimensions—3.5x50—but
> >> > the wrong type. Instead of "cannulated" screws we see "cortical." I'm
> >> > convinced Solr is trolling me at this point :p
> >> >
> >> > --
> >> > *John Blythe*
> >> > Product Manager & Lead Developer
> >> >
> >> > 251.605.3071 | j...@curvolabs.com <javascript:;>
> >> > www.curvolabs.com
> >> >
> >> > 58 Adams Ave
> >> > Evansville, IN 47713
> >> >
> >> > On Mon, May 18, 2015 at 2:34 PM, Doug Turnbull <
> >> > dturnb...@opensourceconnections.com <javascript:;>> wrote:
> >> >
> >> > > You might just need some syntax help. Not sure what the Solr admin
> >> > escapes,
> >> > > but many of the text in your query actually have reserved meaning.
> >> Also,
> >> > > when a term appears without a fieldName:value directly in front of
> >> it, I
> >> > > believe its going to search the default field (it's no longer
> >> attached to
> >> > > the field). You need to use parens to attach multiple terms to that
> >> field
> >> > > for search.
> >> > >
> >> > > I'd try to see if doing any of the following help:
> >> > >
> >> > > Add parens to group terms to the field:
> >> > >
> >> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream
> >> > 1.5pt)
> >> > > +
> >> > > productnumber:(001-029-1298)
> >> > >
> >> > > Also keep in mind "+" means mandatory, and its an operator on just
> one
> >> > > field. So in the above you're requiring description and product
> number
> >> > > match the provided terms.
> >> > >
> >> > > Further, you may need to escape the "-" as that means "NOT". You can
> >> do
> >> > > that with the following:
> >> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream
> >> > 1.5pt)
> >> > > +
> >> > > productnumber:(001\-029\-1298)
> >> > >
> >> > > You can read more in the article on Solr query syntax
> >> > > https://wiki.apache.org/solr/SolrQuerySyntax
> >> > >
> >> > > Hope that helps, for all I know your cut and paste didn't work and
> I'm
> >> > > assuming you have syntax issues :)
> >> > >
> >> > > -Doug
> >> > >
> >> > > On Mon, May 18, 2015 at 2:25 PM, John Blythe <j...@curvolabs.com
> <javascript:;>>
> >> wrote:
> >> > >
> >> > > > Hey Doug,
> >> > > >
> >> > > > Thanks for the quick reply.
> >> > > >
> >> > > > No edismax just yet. Planning on getting there, but have been
> >> trying to
> >> > > > fine tune the 3 primary fields we use over the last week or so
> >> before
> >> > > > jumping into edismax and its nifty toolset to help push our
> accuracy
> >> > and
> >> > > > precision even further (aside: is this a good strategy?)
> >> > > >
> >> > > > For now I'm querying directly in the admin interface, doing
> >> something
> >> > > like
> >> > > > this:
> >> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice
> Cream
> >> > > 1.5pt +
> >> > > > productnumber: 001-029-1298
> >> > > >
> >> > > > versus
> >> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice
> Cream
> >> > 1.5pt
> >> > > >
> >> > > > Another interesting and likely related factor is the description's
> >> lack
> >> > > of
> >> > > > help. With the product number in place it gets nailed even with
> >> stray
> >> > > > zeros, 4's instead of 1's, etc.
> >> > > >
> >> > > > Without it, though, the querying just flat out sucks. For
> instance,
> >> I
> >> > > just
> >> > > > saw something akin to this:
> >> > > > mfgname2: Ben & Jerry's + descript1: Straw Shortcake Ice Cream
> 1.5pt
> >> > > >
> >> > > > that got nowhere near what it should have. Straw would have a
> >> synonym
> >> > to
> >> > > > map to strawberry and would match the document's description
> >> *exactly,
> >> > > *yet
> >> > > > Solr would push out all sorts of peripheral suggestions that
> didn't
> >> > match
> >> > > > strawberry or was a different amount (.75pt, for instance). I know
> >> I'm
> >> > no
> >> > > > expert, but I was thinking my analyzer was a bit better than that
> :p
> >> > > >
> >> > > > --
> >> > > > *John Blythe*
> >> > > > Product Manager & Lead Developer
> >> > > >
> >> > > > 251.605.3071 | j...@curvolabs.com <javascript:;>
> >> > > > www.curvolabs.com
> >> > > >
> >> > > > 58 Adams Ave
> >> > > > Evansville, IN 47713
> >> > > >
> >> > > > On Mon, May 18, 2015 at 2:18 PM, Doug Turnbull <
> >> > > > dturnb...@opensourceconnections.com <javascript:;>> wrote:
> >> > > >
> >> > > > > > The maxScore is 772 when I remove the
> >> > > > > description.
> >> > > > > > I suppose the actual question, then, is if a low relevancy
> >> score on
> >> > > one
> >> > > > > field
> >> > > > > hurts the rest of them / the cumulative score,
> >> > > > >
> >> > > > > This depends a lot on how you're searching over these fields. Is
> >> > this a
> >> > > > > (e)dismax query? Or a lucene query? Something else?
> >> > > > >
> >> > > > > Across fields there's query normalization, which attempts to
> take
> >> a
> >> > sum
> >> > > > of
> >> > > > > squares of IDFs of the search terms across the fields being
> >> searched.
> >> > > > > Adding/removing a field could impact query normalization.
> >> > > > >
> >> > > > > By removing a field, you also likely remove a boolean clause. By
> >> > > removing
> >> > > > > the clause, there's less of a chance the coordinating factor
> >> (known
> >> > as
> >> > > > > coord) would punish your relevancy score.
> >> > > > >
> >> > > > > Otherwise, don't know -- perhaps you could give us more
> >> information
> >> > on
> >> > > > how
> >> > > > > you're searching your documents? Perhaps a sample Solr URL that
> >> shows
> >> > > how
> >> > > > > you're querying?
> >> > > > >
> >> > > > > Cheers,
> >> > > > > --
> >> > > > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> >> > > Connections,
> >> > > > > LLC | 240.476.9983 | http://www.opensourceconnections.com
> >> > > > > Author: Relevant Search <http://manning.com/turnbull> from
> >> Manning
> >> > > > > Publications
> >> > > > > This e-mail and all contents, including attachments, is
> >> considered to
> >> > > be
> >> > > > > Company Confidential unless explicitly stated otherwise,
> >> regardless
> >> > > > > of whether attachments are marked as such.
> >> > > > > On Mon, May 18, 2015 at 1:57 PM, John Blythe <
> j...@curvolabs.com <javascript:;>>
> >> > > wrote:
> >> > > > >
> >> > > > > > Background:
> >> > > > > > I'm using Solr as a mechanism for search for users, but before
> >> even
> >> > > > > getting
> >> > > > > > to that point as a means of intelligent inference more or
> less.
> >> > > Product
> >> > > > > > data comes in and we're hoping to match it to the correct
> known
> >> > > product
> >> > > > > > without having to use the user for confirmation/search.
> >> > > > > >
> >> > > > > > Problem:
> >> > > > > > I get a maxScore (with the correct result at the top) of
> >> 618.22626
> >> > > > using
> >> > > > > > the manufacturer's name, the product number, and the product
> >> > > > description.
> >> > > > > > All of these items are coming from a previous purchaser so we
> >> have
> >> > to
> >> > > > > > account for manufacturer name variations, miskeying of product
> >> > > numbers,
> >> > > > > and
> >> > > > > > variances of descriptions. The maxScore is 772 when I remove
> the
> >> > > > > > description.
> >> > > > > >
> >> > > > > > My initial question is regarding relevancy scoring (
> >> > > > > > https://wiki.apache.org/solr/SolrRelevancyFAQ). I get that
> >> many of
> >> > > the
> >> > > > > > description's tokens will be found throughout the other
> >> documents,
> >> > > thus
> >> > > > > > keeping the relevancy at bay per the IDF portion of the
> >> relevancy
> >> > > > score.
> >> > > > > I
> >> > > > > > suppose the actual question, then, is if a low relevancy score
> >> on
> >> > one
> >> > > > > field
> >> > > > > > hurts the rest of them / the cumulative score, or if it simply
> >> keep
> >> > > > that
> >> > > > > > field's contribution lower than it'd otherwise be. I thought
> it
> >> was
> >> > > the
> >> > > > > > latter, but the results I mention above are making me think
> that
> >> > the
> >> > > > > first
> >> > > > > > scenario is actually the case.
> >> > > > > >
> >> > > > > > Based on what I hear about the above, a follow up question may
> >> be
> >> > > what
> >> > > > in
> >> > > > > > the world is wrong with my analyzer :)
> >> > > > > >
> >> > > > > > Thanks for any thoughts!
> >> > > > > >
> >> > > > > > Best,
> >> > > > > > John
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> >> Connections,
> >> > > LLC | 240.476.9983 | http://www.opensourceconnections.com
> >> > > Author: Relevant Search <http://manning.com/turnbull> from Manning
> >> > > Publications
> >> > > This e-mail and all contents, including attachments, is considered
> to
> >> be
> >> > > Company Confidential unless explicitly stated otherwise, regardless
> >> > > of whether attachments are marked as such.
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> Connections,
> >> LLC | 240.476.9983 | http://www.opensourceconnections.com
> >> Author: Relevant Search <http://manning.com/turnbull> from Manning
> >> Publications
> >> This e-mail and all contents, including attachments, is considered to be
> >> Company Confidential unless explicitly stated otherwise, regardless
> >> of whether attachments are marked as such.
> >>
> >
> >
>


-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Relevant Search <http://manning.com/turnbull> from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: Relevancy Scoring

Reply via email to