I work with a similar catalog; except our data is especially bad. We've found that several things helped:
- Item level grouping (group same item sold by multiple vendors). Rank items with more vendors a bit higher. - Include a boost function for other attributes, such as an original image of the product - Rank items a bit higher if they have data from an external catalog like IceCat - For relevance and performance, we have several fields that we copy data into. High value fields get copied into a high weighted field, while lower value fields like description get copied into a lower weighted field. These fields are the backbone of our qf parameter, with other fields adding additional boost. - Play around with the tie parameter for edismax, we found that it makes quite a big difference. Hope this helps. On Fri, Mar 18, 2016 at 6:19 AM, Alessandro Benedetti <abenede...@apache.org > wrote: > In a relevancy problem I would repeat what my colleagues already pointed > out : > Data is key. We need to understand first of all our data before we can > understand what is relevant and what is not. > Once we specify a groundfloor which make sense ( and your basic approach + > proper schema configuration as suggested + properly configured request > handler , seems a good start to me ) . > > At this point if you are still not happy with the relevancy (i.e. you are > not happy with the different boosts you assigned ) my strongest suggestion > at this time is to move to machine learning. > You need a good amount of data to feed the learner and make it your Super > Business Expert) . > I have been recently working with the Learn To Rank Bloomberg Plugin [1] . > In my opinion will be key for all the business that have many features in > the game, that can help to evaluate a proper ranking. > For that you need to be able to collect and process signals, and you need > to carefully tune the features of your interest. > But the results could be surprising . > > [1] https://issues.apache.org/jira/browse/SOLR-8542 > [2] Learning to Rank in Solr <https://www.youtube.com/watch?v=M7BKwJoh96s> > > Cheers > > On Thu, Mar 17, 2016 at 10:15 AM, Robert Brown <r...@intelcompute.com> > wrote: > > > Thanks Scott and John, > > > > As luck would have it I've got a PhD graduate coming for an interview > > today, who just happened to do her research thesis on information > retrieval > > with quantum theory and machine learning :) > > > > John, it sounds like you're describing my system! Shopping products from > > multiple sources. (De-duplication is going to be fun soon). > > > > I already copy fields like merchant, brand, category, to string fields to > > use them as facets/filters. I was contemplating removing the description > > due to the spammy issue you mentioned, I didn't know about the > > RemoveDuplicatesTokenFilterFactory, so I'm sure that's going to be a huge > > help. > > > > Thanks a lot, > > Rob > > > > > > > > On 03/17/2016 10:01 AM, John Smith wrote: > > > >> Hi, > >> > >> For once I might be of some help: I've had a similar configuration > >> (large set of products from various sources). It's very difficult to > >> find the right balance between all parameters and requires a lot of > >> tweaking, most often in the dark unfortunately. > >> > >> What I've found is that omitNorms=true is a real breakthrough: without > >> it results tend to favor small texts, which is not what's wanted for > >> product names. I also added a RemoveDuplicatesTokenFilterFactory for the > >> name as it's a common practice for spammers to repeat some key words in > >> order to be better placed in results. Stemming and custom stop words > >> (e.g. "cheap", "sale", ...) are other potential ideas. > >> > >> I've also ended up in removing the description field as it's often too > >> broad, and name is now the only field left: brand, category and merchant > >> (as well as other fields) are offered as additional filters using > >> facets. Note that you'd have to re-index them as plain strings. > >> > >> It's more difficult to achieve but popularity boost can also be useful: > >> you can measure it by sales or by number of clicks. I use a combination > >> of both, and store those values using partial updates. > >> > >> Hope it helps, > >> John > >> > >> > >> On 17/03/16 09:36, Robert Brown wrote: > >> > >>> Hi, > >>> > >>> I currently have an index of ~50m docs representing shopping products: > >>> name, description, brand, category, etc. > >>> > >>> Our "qf" is currently setup as: > >>> > >>> name^5 > >>> brand^2 > >>> category^3 > >>> merchant^2 > >>> description^1 > >>> > >>> mm: 100% > >>> ps: 5 > >>> > >>> I'm getting complaints from the business concerning relevancy, and was > >>> hoping to get some constructive ideas/thoughts on whether these boosts > >>> look semi-sensible or not, I think they were put in place pretty much > >>> at random. > >>> > >>> I know it's going to be a case of rounds upon rounds of testing, but > >>> maybe there's a good starting point that will save me some time? > >>> > >>> My initial thoughts right now are to actually just search on the name > >>> field, and maybe the brand (for things like "Apple Ipod"). > >>> > >>> Has anyone got a similar setup that could share some direction? > >>> > >>> Many Thanks, > >>> Rob > >>> > >>> > > > > > -- > -------------------------- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England >