Glad you figured things out and found splainer useful! Pull requests, bugs, feature requests welcome!
https://github.com/o19s/splainer Doug On Monday, May 18, 2015, John Blythe <j...@curvolabs.com> wrote: > Doug, > > very very cool tool you've made there. thanks so much for sharing! > > i ended up removing the shinglefilterfactory and voila! things are back in > good, working order with some great matching. i'm not 100% certain as to > why shingling was so ineffective. i'm guessing the stacked terms created > lower relevancy due to IDF on the *joint *terms/token? > > -- > *John Blythe* > Product Manager & Lead Developer > > 251.605.3071 | j...@curvolabs.com <javascript:;> > www.curvolabs.com > > 58 Adams Ave > Evansville, IN 47713 > > On Mon, May 18, 2015 at 4:57 PM, John Blythe <j...@curvolabs.com > <javascript:;>> wrote: > > > Doug, > > > > A couple things quickly: > > - I'll check in to that. How would you go about testing things, direct > > URL? If so, how would you compose one of the examples above? > > - yup, I used it extensively before testing scores to ensure that I was > > getting things parsed appropriately (segmenting off the unit of measure > > [mm] whilst still maintaining the decimal instead of breaking it up was > my > > largest concern as of late) > > - to that point, though, it looks like one of my blunders was in the > > synonyms file. i just referenced /analysis/ again and realized "CANN" was > > being transposed to "cannula" instead of "cannulated" #facepalm > > - i'll be GLAD to use that! i'd been trying to use > http://explain.solr.pl/ > > previously but it kept error'ing out on me :\ > > > > thanks again, will report back! > > > > -- > > *John Blythe* > > Product Manager & Lead Developer > > > > 251.605.3071 | j...@curvolabs.com <javascript:;> > > www.curvolabs.com > > > > 58 Adams Ave > > Evansville, IN 47713 > > > > On Mon, May 18, 2015 at 4:47 PM, Doug Turnbull < > > dturnb...@opensourceconnections.com <javascript:;>> wrote: > > > >> Hey John, > >> > >> I think you likely do need to think about escaping the query operators. > I > >> doubt the Solr admin could tell the difference. > >> > >> For analysis, have you looked at the handy analysis tool in the Solr > Admin > >> UI? Its pretty indespensible for figuring out if an analyzed query > matches > >> an analyzed field. > >> > >> Outside of that, I can selfishly plug Splainer (http://splainer.io) > that > >> gives you more insight into the Solr relevance explain. You would paste > in > >> something like > >> http://solr.quepid.com/solr/statedecoded/select?q=text:(deer%20hunting) > . > >> > >> Cheers! > >> -Doug > >> > >> On Mon, May 18, 2015 at 3:02 PM, John Blythe <j...@curvolabs.com > <javascript:;>> wrote: > >> > >> > Thanks again for the speediness, Doug. > >> > > >> > Good to know on some of those things, not least of all the + > indicating > >> a > >> > mandatory field and the parentheses. It seems like the escaping is > >> pretty > >> > robust in light of the product number. > >> > > >> > I'm thinking it has to be largely related to the analyzer. Check this > >> out, > >> > this time with more of a real world case for us. Searching for > >> "descript2: > >> > CANN SCREW PT 3.5X50MM" produces a top result that has "Cannulated > >> screw PT > >> > 4.0x40mm" as its description. There is a document, though, that has > the > >> > description of "Cannulated screw PT 3.5x50mm"—the exact same thing > >> (minus > >> > lowercases) rendering that the analyzer is producing (per the > /analysis > >> > page). Why would 4.0x40 come up first? The top four results have > >> > 4.0x[Something]. It's not till the fifth result that you see a 3.5 > >> > something: "Cannulated screw PT 3.5x105mm" at which point I'm saying > >> WTF. > >> > So close, but then it ignores the "50" for a "105" instead. > >> > > >> > Further, adding parenthesis around the phrase—"descript2: (CANN SCREW > PT > >> > 3.5X50MM)"—produces top results that have the correct > >> dimensions—3.5x50—but > >> > the wrong type. Instead of "cannulated" screws we see "cortical." I'm > >> > convinced Solr is trolling me at this point :p > >> > > >> > -- > >> > *John Blythe* > >> > Product Manager & Lead Developer > >> > > >> > 251.605.3071 | j...@curvolabs.com <javascript:;> > >> > www.curvolabs.com > >> > > >> > 58 Adams Ave > >> > Evansville, IN 47713 > >> > > >> > On Mon, May 18, 2015 at 2:34 PM, Doug Turnbull < > >> > dturnb...@opensourceconnections.com <javascript:;>> wrote: > >> > > >> > > You might just need some syntax help. Not sure what the Solr admin > >> > escapes, > >> > > but many of the text in your query actually have reserved meaning. > >> Also, > >> > > when a term appears without a fieldName:value directly in front of > >> it, I > >> > > believe its going to search the default field (it's no longer > >> attached to > >> > > the field). You need to use parens to attach multiple terms to that > >> field > >> > > for search. > >> > > > >> > > I'd try to see if doing any of the following help: > >> > > > >> > > Add parens to group terms to the field: > >> > > > >> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream > >> > 1.5pt) > >> > > + > >> > > productnumber:(001-029-1298) > >> > > > >> > > Also keep in mind "+" means mandatory, and its an operator on just > one > >> > > field. So in the above you're requiring description and product > number > >> > > match the provided terms. > >> > > > >> > > Further, you may need to escape the "-" as that means "NOT". You can > >> do > >> > > that with the following: > >> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream > >> > 1.5pt) > >> > > + > >> > > productnumber:(001\-029\-1298) > >> > > > >> > > You can read more in the article on Solr query syntax > >> > > https://wiki.apache.org/solr/SolrQuerySyntax > >> > > > >> > > Hope that helps, for all I know your cut and paste didn't work and > I'm > >> > > assuming you have syntax issues :) > >> > > > >> > > -Doug > >> > > > >> > > On Mon, May 18, 2015 at 2:25 PM, John Blythe <j...@curvolabs.com > <javascript:;>> > >> wrote: > >> > > > >> > > > Hey Doug, > >> > > > > >> > > > Thanks for the quick reply. > >> > > > > >> > > > No edismax just yet. Planning on getting there, but have been > >> trying to > >> > > > fine tune the 3 primary fields we use over the last week or so > >> before > >> > > > jumping into edismax and its nifty toolset to help push our > accuracy > >> > and > >> > > > precision even further (aside: is this a good strategy?) > >> > > > > >> > > > For now I'm querying directly in the admin interface, doing > >> something > >> > > like > >> > > > this: > >> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice > Cream > >> > > 1.5pt + > >> > > > productnumber: 001-029-1298 > >> > > > > >> > > > versus > >> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice > Cream > >> > 1.5pt > >> > > > > >> > > > Another interesting and likely related factor is the description's > >> lack > >> > > of > >> > > > help. With the product number in place it gets nailed even with > >> stray > >> > > > zeros, 4's instead of 1's, etc. > >> > > > > >> > > > Without it, though, the querying just flat out sucks. For > instance, > >> I > >> > > just > >> > > > saw something akin to this: > >> > > > mfgname2: Ben & Jerry's + descript1: Straw Shortcake Ice Cream > 1.5pt > >> > > > > >> > > > that got nowhere near what it should have. Straw would have a > >> synonym > >> > to > >> > > > map to strawberry and would match the document's description > >> *exactly, > >> > > *yet > >> > > > Solr would push out all sorts of peripheral suggestions that > didn't > >> > match > >> > > > strawberry or was a different amount (.75pt, for instance). I know > >> I'm > >> > no > >> > > > expert, but I was thinking my analyzer was a bit better than that > :p > >> > > > > >> > > > -- > >> > > > *John Blythe* > >> > > > Product Manager & Lead Developer > >> > > > > >> > > > 251.605.3071 | j...@curvolabs.com <javascript:;> > >> > > > www.curvolabs.com > >> > > > > >> > > > 58 Adams Ave > >> > > > Evansville, IN 47713 > >> > > > > >> > > > On Mon, May 18, 2015 at 2:18 PM, Doug Turnbull < > >> > > > dturnb...@opensourceconnections.com <javascript:;>> wrote: > >> > > > > >> > > > > > The maxScore is 772 when I remove the > >> > > > > description. > >> > > > > > I suppose the actual question, then, is if a low relevancy > >> score on > >> > > one > >> > > > > field > >> > > > > hurts the rest of them / the cumulative score, > >> > > > > > >> > > > > This depends a lot on how you're searching over these fields. Is > >> > this a > >> > > > > (e)dismax query? Or a lucene query? Something else? > >> > > > > > >> > > > > Across fields there's query normalization, which attempts to > take > >> a > >> > sum > >> > > > of > >> > > > > squares of IDFs of the search terms across the fields being > >> searched. > >> > > > > Adding/removing a field could impact query normalization. > >> > > > > > >> > > > > By removing a field, you also likely remove a boolean clause. By > >> > > removing > >> > > > > the clause, there's less of a chance the coordinating factor > >> (known > >> > as > >> > > > > coord) would punish your relevancy score. > >> > > > > > >> > > > > Otherwise, don't know -- perhaps you could give us more > >> information > >> > on > >> > > > how > >> > > > > you're searching your documents? Perhaps a sample Solr URL that > >> shows > >> > > how > >> > > > > you're querying? > >> > > > > > >> > > > > Cheers, > >> > > > > -- > >> > > > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource > >> > > Connections, > >> > > > > LLC | 240.476.9983 | http://www.opensourceconnections.com > >> > > > > Author: Relevant Search <http://manning.com/turnbull> from > >> Manning > >> > > > > Publications > >> > > > > This e-mail and all contents, including attachments, is > >> considered to > >> > > be > >> > > > > Company Confidential unless explicitly stated otherwise, > >> regardless > >> > > > > of whether attachments are marked as such. > >> > > > > On Mon, May 18, 2015 at 1:57 PM, John Blythe < > j...@curvolabs.com <javascript:;>> > >> > > wrote: > >> > > > > > >> > > > > > Background: > >> > > > > > I'm using Solr as a mechanism for search for users, but before > >> even > >> > > > > getting > >> > > > > > to that point as a means of intelligent inference more or > less. > >> > > Product > >> > > > > > data comes in and we're hoping to match it to the correct > known > >> > > product > >> > > > > > without having to use the user for confirmation/search. > >> > > > > > > >> > > > > > Problem: > >> > > > > > I get a maxScore (with the correct result at the top) of > >> 618.22626 > >> > > > using > >> > > > > > the manufacturer's name, the product number, and the product > >> > > > description. > >> > > > > > All of these items are coming from a previous purchaser so we > >> have > >> > to > >> > > > > > account for manufacturer name variations, miskeying of product > >> > > numbers, > >> > > > > and > >> > > > > > variances of descriptions. The maxScore is 772 when I remove > the > >> > > > > > description. > >> > > > > > > >> > > > > > My initial question is regarding relevancy scoring ( > >> > > > > > https://wiki.apache.org/solr/SolrRelevancyFAQ). I get that > >> many of > >> > > the > >> > > > > > description's tokens will be found throughout the other > >> documents, > >> > > thus > >> > > > > > keeping the relevancy at bay per the IDF portion of the > >> relevancy > >> > > > score. > >> > > > > I > >> > > > > > suppose the actual question, then, is if a low relevancy score > >> on > >> > one > >> > > > > field > >> > > > > > hurts the rest of them / the cumulative score, or if it simply > >> keep > >> > > > that > >> > > > > > field's contribution lower than it'd otherwise be. I thought > it > >> was > >> > > the > >> > > > > > latter, but the results I mention above are making me think > that > >> > the > >> > > > > first > >> > > > > > scenario is actually the case. > >> > > > > > > >> > > > > > Based on what I hear about the above, a follow up question may > >> be > >> > > what > >> > > > in > >> > > > > > the world is wrong with my analyzer :) > >> > > > > > > >> > > > > > Thanks for any thoughts! > >> > > > > > > >> > > > > > Best, > >> > > > > > John > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> > > -- > >> > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource > >> Connections, > >> > > LLC | 240.476.9983 | http://www.opensourceconnections.com > >> > > Author: Relevant Search <http://manning.com/turnbull> from Manning > >> > > Publications > >> > > This e-mail and all contents, including attachments, is considered > to > >> be > >> > > Company Confidential unless explicitly stated otherwise, regardless > >> > > of whether attachments are marked as such. > >> > > > >> > > >> > >> > >> > >> -- > >> *Doug Turnbull **| *Search Relevance Consultant | OpenSource > Connections, > >> LLC | 240.476.9983 | http://www.opensourceconnections.com > >> Author: Relevant Search <http://manning.com/turnbull> from Manning > >> Publications > >> This e-mail and all contents, including attachments, is considered to be > >> Company Confidential unless explicitly stated otherwise, regardless > >> of whether attachments are marked as such. > >> > > > > > -- *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com Author: Relevant Search <http://manning.com/turnbull> from Manning Publications This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.