Doug, very very cool tool you've made there. thanks so much for sharing!
i ended up removing the shinglefilterfactory and voila! things are back in good, working order with some great matching. i'm not 100% certain as to why shingling was so ineffective. i'm guessing the stacked terms created lower relevancy due to IDF on the *joint *terms/token? -- *John Blythe* Product Manager & Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Mon, May 18, 2015 at 4:57 PM, John Blythe <j...@curvolabs.com> wrote: > Doug, > > A couple things quickly: > - I'll check in to that. How would you go about testing things, direct > URL? If so, how would you compose one of the examples above? > - yup, I used it extensively before testing scores to ensure that I was > getting things parsed appropriately (segmenting off the unit of measure > [mm] whilst still maintaining the decimal instead of breaking it up was my > largest concern as of late) > - to that point, though, it looks like one of my blunders was in the > synonyms file. i just referenced /analysis/ again and realized "CANN" was > being transposed to "cannula" instead of "cannulated" #facepalm > - i'll be GLAD to use that! i'd been trying to use http://explain.solr.pl/ > previously but it kept error'ing out on me :\ > > thanks again, will report back! > > -- > *John Blythe* > Product Manager & Lead Developer > > 251.605.3071 | j...@curvolabs.com > www.curvolabs.com > > 58 Adams Ave > Evansville, IN 47713 > > On Mon, May 18, 2015 at 4:47 PM, Doug Turnbull < > dturnb...@opensourceconnections.com> wrote: > >> Hey John, >> >> I think you likely do need to think about escaping the query operators. I >> doubt the Solr admin could tell the difference. >> >> For analysis, have you looked at the handy analysis tool in the Solr Admin >> UI? Its pretty indespensible for figuring out if an analyzed query matches >> an analyzed field. >> >> Outside of that, I can selfishly plug Splainer (http://splainer.io) that >> gives you more insight into the Solr relevance explain. You would paste in >> something like >> http://solr.quepid.com/solr/statedecoded/select?q=text:(deer%20hunting). >> >> Cheers! >> -Doug >> >> On Mon, May 18, 2015 at 3:02 PM, John Blythe <j...@curvolabs.com> wrote: >> >> > Thanks again for the speediness, Doug. >> > >> > Good to know on some of those things, not least of all the + indicating >> a >> > mandatory field and the parentheses. It seems like the escaping is >> pretty >> > robust in light of the product number. >> > >> > I'm thinking it has to be largely related to the analyzer. Check this >> out, >> > this time with more of a real world case for us. Searching for >> "descript2: >> > CANN SCREW PT 3.5X50MM" produces a top result that has "Cannulated >> screw PT >> > 4.0x40mm" as its description. There is a document, though, that has the >> > description of "Cannulated screw PT 3.5x50mm"—the exact same thing >> (minus >> > lowercases) rendering that the analyzer is producing (per the /analysis >> > page). Why would 4.0x40 come up first? The top four results have >> > 4.0x[Something]. It's not till the fifth result that you see a 3.5 >> > something: "Cannulated screw PT 3.5x105mm" at which point I'm saying >> WTF. >> > So close, but then it ignores the "50" for a "105" instead. >> > >> > Further, adding parenthesis around the phrase—"descript2: (CANN SCREW PT >> > 3.5X50MM)"—produces top results that have the correct >> dimensions—3.5x50—but >> > the wrong type. Instead of "cannulated" screws we see "cortical." I'm >> > convinced Solr is trolling me at this point :p >> > >> > -- >> > *John Blythe* >> > Product Manager & Lead Developer >> > >> > 251.605.3071 | j...@curvolabs.com >> > www.curvolabs.com >> > >> > 58 Adams Ave >> > Evansville, IN 47713 >> > >> > On Mon, May 18, 2015 at 2:34 PM, Doug Turnbull < >> > dturnb...@opensourceconnections.com> wrote: >> > >> > > You might just need some syntax help. Not sure what the Solr admin >> > escapes, >> > > but many of the text in your query actually have reserved meaning. >> Also, >> > > when a term appears without a fieldName:value directly in front of >> it, I >> > > believe its going to search the default field (it's no longer >> attached to >> > > the field). You need to use parens to attach multiple terms to that >> field >> > > for search. >> > > >> > > I'd try to see if doing any of the following help: >> > > >> > > Add parens to group terms to the field: >> > > >> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream >> > 1.5pt) >> > > + >> > > productnumber:(001-029-1298) >> > > >> > > Also keep in mind "+" means mandatory, and its an operator on just one >> > > field. So in the above you're requiring description and product number >> > > match the provided terms. >> > > >> > > Further, you may need to escape the "-" as that means "NOT". You can >> do >> > > that with the following: >> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream >> > 1.5pt) >> > > + >> > > productnumber:(001\-029\-1298) >> > > >> > > You can read more in the article on Solr query syntax >> > > https://wiki.apache.org/solr/SolrQuerySyntax >> > > >> > > Hope that helps, for all I know your cut and paste didn't work and I'm >> > > assuming you have syntax issues :) >> > > >> > > -Doug >> > > >> > > On Mon, May 18, 2015 at 2:25 PM, John Blythe <j...@curvolabs.com> >> wrote: >> > > >> > > > Hey Doug, >> > > > >> > > > Thanks for the quick reply. >> > > > >> > > > No edismax just yet. Planning on getting there, but have been >> trying to >> > > > fine tune the 3 primary fields we use over the last week or so >> before >> > > > jumping into edismax and its nifty toolset to help push our accuracy >> > and >> > > > precision even further (aside: is this a good strategy?) >> > > > >> > > > For now I'm querying directly in the admin interface, doing >> something >> > > like >> > > > this: >> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice Cream >> > > 1.5pt + >> > > > productnumber: 001-029-1298 >> > > > >> > > > versus >> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice Cream >> > 1.5pt >> > > > >> > > > Another interesting and likely related factor is the description's >> lack >> > > of >> > > > help. With the product number in place it gets nailed even with >> stray >> > > > zeros, 4's instead of 1's, etc. >> > > > >> > > > Without it, though, the querying just flat out sucks. For instance, >> I >> > > just >> > > > saw something akin to this: >> > > > mfgname2: Ben & Jerry's + descript1: Straw Shortcake Ice Cream 1.5pt >> > > > >> > > > that got nowhere near what it should have. Straw would have a >> synonym >> > to >> > > > map to strawberry and would match the document's description >> *exactly, >> > > *yet >> > > > Solr would push out all sorts of peripheral suggestions that didn't >> > match >> > > > strawberry or was a different amount (.75pt, for instance). I know >> I'm >> > no >> > > > expert, but I was thinking my analyzer was a bit better than that :p >> > > > >> > > > -- >> > > > *John Blythe* >> > > > Product Manager & Lead Developer >> > > > >> > > > 251.605.3071 | j...@curvolabs.com >> > > > www.curvolabs.com >> > > > >> > > > 58 Adams Ave >> > > > Evansville, IN 47713 >> > > > >> > > > On Mon, May 18, 2015 at 2:18 PM, Doug Turnbull < >> > > > dturnb...@opensourceconnections.com> wrote: >> > > > >> > > > > > The maxScore is 772 when I remove the >> > > > > description. >> > > > > > I suppose the actual question, then, is if a low relevancy >> score on >> > > one >> > > > > field >> > > > > hurts the rest of them / the cumulative score, >> > > > > >> > > > > This depends a lot on how you're searching over these fields. Is >> > this a >> > > > > (e)dismax query? Or a lucene query? Something else? >> > > > > >> > > > > Across fields there's query normalization, which attempts to take >> a >> > sum >> > > > of >> > > > > squares of IDFs of the search terms across the fields being >> searched. >> > > > > Adding/removing a field could impact query normalization. >> > > > > >> > > > > By removing a field, you also likely remove a boolean clause. By >> > > removing >> > > > > the clause, there's less of a chance the coordinating factor >> (known >> > as >> > > > > coord) would punish your relevancy score. >> > > > > >> > > > > Otherwise, don't know -- perhaps you could give us more >> information >> > on >> > > > how >> > > > > you're searching your documents? Perhaps a sample Solr URL that >> shows >> > > how >> > > > > you're querying? >> > > > > >> > > > > Cheers, >> > > > > -- >> > > > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource >> > > Connections, >> > > > > LLC | 240.476.9983 | http://www.opensourceconnections.com >> > > > > Author: Relevant Search <http://manning.com/turnbull> from >> Manning >> > > > > Publications >> > > > > This e-mail and all contents, including attachments, is >> considered to >> > > be >> > > > > Company Confidential unless explicitly stated otherwise, >> regardless >> > > > > of whether attachments are marked as such. >> > > > > On Mon, May 18, 2015 at 1:57 PM, John Blythe <j...@curvolabs.com> >> > > wrote: >> > > > > >> > > > > > Background: >> > > > > > I'm using Solr as a mechanism for search for users, but before >> even >> > > > > getting >> > > > > > to that point as a means of intelligent inference more or less. >> > > Product >> > > > > > data comes in and we're hoping to match it to the correct known >> > > product >> > > > > > without having to use the user for confirmation/search. >> > > > > > >> > > > > > Problem: >> > > > > > I get a maxScore (with the correct result at the top) of >> 618.22626 >> > > > using >> > > > > > the manufacturer's name, the product number, and the product >> > > > description. >> > > > > > All of these items are coming from a previous purchaser so we >> have >> > to >> > > > > > account for manufacturer name variations, miskeying of product >> > > numbers, >> > > > > and >> > > > > > variances of descriptions. The maxScore is 772 when I remove the >> > > > > > description. >> > > > > > >> > > > > > My initial question is regarding relevancy scoring ( >> > > > > > https://wiki.apache.org/solr/SolrRelevancyFAQ). I get that >> many of >> > > the >> > > > > > description's tokens will be found throughout the other >> documents, >> > > thus >> > > > > > keeping the relevancy at bay per the IDF portion of the >> relevancy >> > > > score. >> > > > > I >> > > > > > suppose the actual question, then, is if a low relevancy score >> on >> > one >> > > > > field >> > > > > > hurts the rest of them / the cumulative score, or if it simply >> keep >> > > > that >> > > > > > field's contribution lower than it'd otherwise be. I thought it >> was >> > > the >> > > > > > latter, but the results I mention above are making me think that >> > the >> > > > > first >> > > > > > scenario is actually the case. >> > > > > > >> > > > > > Based on what I hear about the above, a follow up question may >> be >> > > what >> > > > in >> > > > > > the world is wrong with my analyzer :) >> > > > > > >> > > > > > Thanks for any thoughts! >> > > > > > >> > > > > > Best, >> > > > > > John >> > > > > > >> > > > > >> > > > >> > > >> > > >> > > >> > > -- >> > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource >> Connections, >> > > LLC | 240.476.9983 | http://www.opensourceconnections.com >> > > Author: Relevant Search <http://manning.com/turnbull> from Manning >> > > Publications >> > > This e-mail and all contents, including attachments, is considered to >> be >> > > Company Confidential unless explicitly stated otherwise, regardless >> > > of whether attachments are marked as such. >> > > >> > >> >> >> >> -- >> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, >> LLC | 240.476.9983 | http://www.opensourceconnections.com >> Author: Relevant Search <http://manning.com/turnbull> from Manning >> Publications >> This e-mail and all contents, including attachments, is considered to be >> Company Confidential unless explicitly stated otherwise, regardless >> of whether attachments are marked as such. >> > >