Re: Mahout performance issues

Daniel Zohar Mon, 05 Dec 2011 00:21:20 -0800

I agree with with Ted that users with many preferences should be
down-sampled.
I think that if we do go with with #1,#2 and #3 then there's not much point
in #4. We just have to make sure that the final size of possibleItemIDs is
under control, so it eliminates the performance bottleneck.
Another issue to take into account, is to try and not down-sample too much
so users with 1-2 preferences still get decent results.


On Sun, Dec 4, 2011 at 11:01 PM, Ted Dunning <[email protected]> wrote:

> Sean,
>
> You can also do #1.  That is what I have used in the past and what I
> recommend.  That achieves a large part of #2, but what is most important is
> that it *directly* addresses the key cost factor in off-line
> recommendations since the number of item pairs emitted is proportional to
> the sum of the number of items squared for each user.
>
> Specifically, I think that each user should have at most N items and if
> they have more, the number they have should be down-sampled to the point
> that they have N.
>
> I also think that there are some cases were strategy #2 is important even
> if #1 is implemented.
>
> If #1 and #2 are done, then it is a matter of convenience to limit the
> number of items in each row of the item-item matrix.  This is #4 which I
> endorse and which Sebastian has endorsed.
>
> On Sun, Dec 4, 2011 at 5:42 AM, Sean Owen <[email protected]> wrote:
>
> > To talk about this clearly, let me go back to my example and add to it:
> >
> > ---
> > Say we're recommending for user A. User A is connected to items 1, 2, 3.
> > Those items are connected to other users X, Y, Z. And those users in turn
> > are connected to items 100, 101, 102, 103.... You can down-sample three
> > things:
> >
> > 1. The 1,2,3
> > 2. The X,Y,Z
> > 3. The 100,101,102
> > 4. ... the result of downsampling 1-3, again
> > ---
> >
> > The current implementation samples #2. My proposal samples #2 and #3.
> > Sebastian's samples #3. Your proposal does #2 and #4. I believe that
> doing
> > all 4 is redundant. You probably need to do at least #2 and #3 to avoid
> the
> > prolific-user and prolific-item problem.
> >
> > The reason you are still seeing a fair number of IDs is that #1 is not
> also
> > sampled, in my implementation.
> >
> > I think I suggest that we still have one solution for this, since it's
> all
> > small variants on the same theme, and let's make in
> > SamplingCandidateItemStrategy.
> >
> > To me, the remaining question is just, which of these 4 do you want to
> do?
> > I suggest 2, 3, and maybe 1.
> > Follow on question: should we make separately settable limits for each,
> or
> > does this get complex without much use?
> >
> > On Sun, Dec 4, 2011 at 1:04 PM, Daniel Zohar <[email protected]> wrote:
> >
> > > I assume the parameter does not affect the possibleItemIDs because of
> the
> > > following line:
> > >
> > > max = (int)
> > > Math.max(defaultMaxPrefsPerItemConsidered, userItemCountMultiplier *
> > > Math.log(Math.max(dataModel.getNumUsers(), dataModel.getNumItems())));
> > >
> > > On Sun, Dec 4, 2011 at 2:59 PM, Daniel Zohar <[email protected]>
> wrote:
> > >
> > > > Sean, your impl. is indeed better than mine but for some reason when
> I
> > > ran
> > > > it with for a user with a lot of interactions, I got 2023
> > possibleItemIDs
> > > > (although I used 10,2 in the constructor).
> > > >
> > > > Sebastian, I will try and expriment also with your patch. I would
> just
> > > > like to add that in my opinion, as long as 'killing items' has to be
> > done
> > > > manually, it is not scalable by definition. I personally would always
> > > > prefer to avoid these kind of solutions. Also, in my case, the most
> > > popular
> > > > item has only 3% of the users interacted with, so I suppose that's
> not
> > > > exactly the case as well..
> > >
> >
>

Re: Mahout performance issues

Reply via email to