I work with a similar catalog; except our data is especially bad.  We've
found that several things helped:

- Item level grouping (group same item sold by multiple vendors). Rank
items with more vendors a bit higher.
- Include a boost function for other attributes, such as an original image
of the product
- Rank items a bit higher if they have data from an external catalog like
IceCat
- For relevance and performance, we have several fields that we copy data
into. High value fields get copied into a high weighted field, while lower
value fields like description get copied into a lower weighted field. These
fields are the backbone of our qf parameter, with other fields adding
additional boost.
- Play around with the tie parameter for edismax, we found that it makes
quite a big difference.

Hope this helps.

On Fri, Mar 18, 2016 at 6:19 AM, Alessandro Benedetti <abenede...@apache.org
> wrote:

> In a relevancy problem I would repeat what my colleagues already pointed
> out :
> Data is key. We need to understand first of all our data before we can
> understand what is relevant and what is not.
> Once we specify a groundfloor which make sense ( and your basic approach +
> proper schema configuration as suggested + properly configured request
> handler , seems a good start to me ) .
>
> At this point if you are still not happy with the relevancy (i.e. you are
> not happy with the different boosts you assigned ) my strongest suggestion
> at this time is to move to machine learning.
> You need a good amount of data to feed the learner and make it your Super
> Business Expert) .
> I have been recently working with the Learn To Rank Bloomberg Plugin [1] .
> In  my opinion will be key for all the business that have many features in
> the game, that can help to evaluate a proper ranking.
> For that you need to be able to collect and process signals, and you need
> to carefully tune the features of your interest.
> But the results could be surprising .
>
> [1] https://issues.apache.org/jira/browse/SOLR-8542
> [2] Learning to Rank in Solr <https://www.youtube.com/watch?v=M7BKwJoh96s>
>
> Cheers
>
> On Thu, Mar 17, 2016 at 10:15 AM, Robert Brown <r...@intelcompute.com>
> wrote:
>
> > Thanks Scott and John,
> >
> > As luck would have it I've got a PhD graduate coming for an interview
> > today, who just happened to do her research thesis on information
> retrieval
> > with quantum theory and machine learning  :)
> >
> > John, it sounds like you're describing my system!  Shopping products from
> > multiple sources.  (De-duplication is going to be fun soon).
> >
> > I already copy fields like merchant, brand, category, to string fields to
> > use them as facets/filters.  I was contemplating removing the description
> > due to the spammy issue you mentioned, I didn't know about the
> > RemoveDuplicatesTokenFilterFactory, so I'm sure that's going to be a huge
> > help.
> >
> > Thanks a lot,
> > Rob
> >
> >
> >
> > On 03/17/2016 10:01 AM, John Smith wrote:
> >
> >> Hi,
> >>
> >> For once I might be of some help: I've had a similar configuration
> >> (large set of products from various sources). It's very difficult to
> >> find the right balance between all parameters and requires a lot of
> >> tweaking, most often in the dark unfortunately.
> >>
> >> What I've found is that omitNorms=true is a real breakthrough: without
> >> it results tend to favor small texts, which is not what's wanted for
> >> product names. I also added a RemoveDuplicatesTokenFilterFactory for the
> >> name as it's a common practice for spammers to repeat some key words in
> >> order to be better placed in results. Stemming and custom stop words
> >> (e.g. "cheap", "sale", ...) are other potential ideas.
> >>
> >> I've also ended up in removing the description field as it's often too
> >> broad, and name is now the only field left: brand, category and merchant
> >> (as well as other fields) are offered as additional filters using
> >> facets. Note that you'd have to re-index them as plain strings.
> >>
> >> It's more difficult to achieve but popularity boost can also be useful:
> >> you can measure it by sales or by number of clicks. I use a combination
> >> of both, and store those values using partial updates.
> >>
> >> Hope it helps,
> >> John
> >>
> >>
> >> On 17/03/16 09:36, Robert Brown wrote:
> >>
> >>> Hi,
> >>>
> >>> I currently have an index of ~50m docs representing shopping products:
> >>> name, description, brand, category, etc.
> >>>
> >>> Our "qf" is currently setup as:
> >>>
> >>> name^5
> >>> brand^2
> >>> category^3
> >>> merchant^2
> >>> description^1
> >>>
> >>> mm: 100%
> >>> ps: 5
> >>>
> >>> I'm getting complaints from the business concerning relevancy, and was
> >>> hoping to get some constructive ideas/thoughts on whether these boosts
> >>> look semi-sensible or not, I think they were put in place pretty much
> >>> at random.
> >>>
> >>> I know it's going to be a case of rounds upon rounds of testing, but
> >>> maybe there's a good starting point that will save me some time?
> >>>
> >>> My initial thoughts right now are to actually just search on the name
> >>> field, and maybe the brand (for things like "Apple Ipod").
> >>>
> >>> Has anyone got a similar setup that could share some direction?
> >>>
> >>> Many Thanks,
> >>> Rob
> >>>
> >>>
> >
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Reply via email to