Re: Which database should I use with Mahout

Johannes Schulte Wed, 22 May 2013 21:51:44 -0700

Yeah, that is what i had in mind as a simple solution. For examining bigger
result sets I always fear the cost of loading a lot of stored fields,
that's why i thought of including it in the scoring might be cool. It's not
possible with a plain Collector that maintains a priorityqueue of docid and
score but there should be some smart way to maintain a coordinate of the
top ranking items so far. Anyway, that's a different story


Thanks for the help so far!


On Thu, May 23, 2013 at 2:14 AM, Ted Dunning <[email protected]> wrote:

> Yes what you are describing with diversification is something that I have
> called anti-flood. It comes from the fact that we really are optimizing a
> portfolio of recommendations rather than a batch of independent
> recommendations. Doing this from first principles is very hard but there
> are very simple heuristics that do well in many practical situations.  For
> instance, simply penalizing the score of items based on how many items
> ranked above them that they are excessively close to does wonders.
>
> Sent from my iPhone
>
> On May 22, 2013, at 13:04, Johannes Schulte <[email protected]>
> wrote:
>
> > Okay i got it! I also always have used a basic form of dithering but we
> > always called it shuffle since it basically was / is Collections.shuffle
> on
> > a bigger list of results and therefore doesnt take the rank or score into
> > account. Will try that..
> >
> > With diversification i really meant more something that guarantees that
> you
> > maximize the intra-item distance (dithering often does but not on
> purpose).
> > The theory is, if i remember correctly, that if a user goes over the list
> > from top to bottom and he didn't like the first item (which is assumed to
> > be true if he looks at the second item, no idea if we really work that
> > way), it makes sense to make a new shot with something completely
> > different. and so on and on for the next items. I think it's also a topic
> > for search engine results and i remember somethink like
> "lipschitz-bandits"
> > from my googling, but that is way above of what i am capable of doing. I
> > just recognized that both amazon and netflix present multiple
> > recommendation lists grouped by category, so in a way it's similar to
> > search engine result clustering.
> >
> >
> >
> >
> >
> > On Wed, May 22, 2013 at 8:30 AM, Ted Dunning <[email protected]>
> wrote:
> >
> >> On Tue, May 21, 2013 at 10:34 PM, Johannes Schulte <
> >> [email protected]> wrote:
> >>
> >>> Thanks for the list...as a non native speaker I got problems
> >> understanding
> >>> the meaning of dithering here.
> >>
> >> Sorry about that.  Your English is good enough that I hadn't noticed any
> >> deficit.
> >>
> >> Dithering is constructive mixing of the recommendation results.  The
> idea
> >> is to reorder the top results only slightly and the deeper results more
> so.
> >> There are several good effects and one (slightly) bad one.
> >>
> >> The good effects are:
> >>
> >> a) there is much less of a sharp boundary at the end of the first page
> of
> >> results.  This makes the statistics of the recommender work better and
> also
> >> helps the recommender not get stuck recommending just the things which
> >> already appear on the first page.
> >>
> >> b) results that are very deep in the results can still be shown
> >> occasionally.  This means that if the rec engine has even a hint that
> >> something is good, it has a chance of increasing the ranking by
> gathering
> >> more data.  This is a bit different from (a)
> >>
> >> c) (bonus benefit) users like seeing novel things.  Even if they have
> done
> >> nothing to change their recommendations, they like seeing that you have
> >> changed something so they keep coming back to the recommendation page.
> >>
> >> The major bad effect is that you are purposely decreasing relevance in
> the
> >> short term in order to get more information that will improve relevance
> in
> >> the long term.  The improvements dramatically outweigh this small
> problem.
> >>
> >>
> >>> I got the feeling that somewhere between a) and d) there is also
> >>> diversification of items in the recommendation list, so increasing the
> >>> distance between the list items according to some metric like tf/idf on
> >>> item information. Never tried that, but with lucene / solr it should be
> >>> possible to use this information during scoring..
> >>
> >> Yes.  But no.
> >>
> >> This can be done at the presentation tier entirely.  I often do it by
> >> defining a score based solely on rank, typically something like log(r).
>  I
> >> add small amounts of noise to this synthetic score, often distributed
> >> exponentially with a small mean.  Then I sort the results according to
> this
> >> sum.
> >>
> >> Here are some simulated results computed using R:
> >>
> >>> (order((log(r) - runif(500, max=2)))[1:20])
> >> [1]  1  2  3  6  5  4 14  9  8 10  7 17 11 15 13 22 28 12 20 39
> >> [1]  1  2  5  3  4  8  6 10  9 16 24 31 20 30 13 18  7 14 36 38
> >> [1]  3  1  5  2 10  4  8  7 14 21 19 26 29 13 27 15  6 12 33  9
> >> [1]  1  2  5  3  6 17  4 20 18  7 19  9 25  8 29 21 15 27 28 12
> >> [1]  1  2  5  3  7  4  8 11  9 15 10  6 33 37 17 27 36 16 34 38
> >> [1]  1  4  2  5  9  3 14 13 12 17 22 25  7 15 18 36 16  6 20 29
> >> [1]  1  3  4  7  2  6  5 12 18 17 13 24 27 10  8 20 14 34  9 46
> >> [1]  3  1  2  6 12  8  7  5  4 19 11 26 10 15 28 35  9 20 42 25
> >>
> >> As you can see, the first four results are commonly single digits.  This
> >> comes about because the uniform noise that I have subtracted from the
> log
> >> can only make a difference of 2 to the log with is equivalent of
> changing
> >> the rank by a factor of about 7. If we were to use different noise
> >> distributions we would get somewhat different kinds of perturbation.
>  For
> >> instance, using exponentially distributed noise gives mostly tame
> results
> >> with some real surprises:
> >>
> >>> (order((log(r) - 0.3*rexp(500)))[1:20])
> >> [1]  1  2  3  8  4  5  9  6  7 25 14 11 13 24 10 31 34 12 22 21
> >> [1]  1  2  5  4  3  6  7 12  8 10  9 17 13 11 14 25 64 15 47 19
> >> [1]  1  2  3  4  5  6  7 10  8  9 11 21 13 12 15 16 14 25 18 33
> >> [1]  1  2  3 10  4  5  7 14  6  8 13  9 15 25 16 11 20 12 17 54
> >> [1]  1  3  2  4  7  5  6  8 11 23  9 32 18 10 13 15 12 48 14 19
> >> [1]  1  3  2  4  5 10 12  6  9  7  8 18 16 17 11 13 25 14 15 19
> >> [1]  6  1  2  4  3  5  9 11  7 15  8 10 14 12 19 16 13 25 39 18
> >> [1]  1  2  3  4 30  5  7  6  9  8 16 11 10 15 12 13 37 14 31 23
> >> [1]  1  2  3  4  9 16  5  6  8  7 10 13 11 17 15 19 12 20 14 26
> >> [1]  1  2  3 13  5  4  7  6  8 15 12 11  9 10 36 14 24 70 19 16
> >> [1]   1   2   6   3   5   4  11  22   7   9 250   8  10  15  12  17 13
>  40
> >> 16  14
> >>
> >>
> >>
> >>
> >>> Have a nice day
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, May 22, 2013 at 2:30 AM, Ted Dunning <[email protected]>
> >>> wrote:
> >>>
> >>>> I have so far just used the weights that Solr applies natively.
> >>>>
> >>>> In my experience, what makes a recommendation engine work better is,
> in
> >>>> order of importance,
> >>>>
> >>>> a) dithering so that you gather wider data
> >>>>
> >>>> b) using multiple sources of input
> >>>>
> >>>> c) returning results quickly and reliably
> >>>>
> >>>> d) the actual algorithm or weighting scheme
> >>>>
> >>>> If you can cover items a-c in a real business, you are very lucky.
>  The
> >>>> search engine approach handles (b) and (c) by nature which massively
> >>>> improves the likelihood of ever getting to examine (d).
> >>>>
> >>>>
> >>>> On Tue, May 21, 2013 at 1:13 AM, Johannes Schulte <
> >>>> [email protected]> wrote:
> >>>>
> >>>>> Thanks! Could you also add how to learn the weights you talked about,
> >>> or
> >>>> at
> >>>>> least a hint? Learning weights for search engine query terms always
> >>>> sounds
> >>>>> like  "learning to rank" to me but this always seemed pretty
> >>> complicated
> >>>>> and i never managed to try it out..
> >>
>

Re: Which database should I use with Mahout

Reply via email to