Re: Mahout performance issues

Daniel Zohar Fri, 02 Dec 2011 03:03:49 -0800

Hi guys,

@Sean, You are obviously right by saying that reducing the cap limit would
yield better performance. However I believe it would yield worse accuracy.
This is because the more items a user interacted with, the smaller is
the percentage of the capped possible items relatively to the actual
possible items.

@Ted, your approach is also good and I will think now how to integrate it
in my solution.

I just ran the fix I proposed earlier and I got great results! The query
time was reduced to about a third for the 'heavy users'. Before it was 1-5
secs and now it's 0.5-1.5. The best part is that the accuracy level should
remain exactly the same. I also believe it should reduce memory
consumption, as the GenericBooleanPrefDataModel.preferenceForItems gets
significantly smaller (in my case at least).

The fix is merely adding two lines of code to one of
the GenericBooleanPrefDataModel constructors. See
http://pastebin.com/K5PB68Et, the lines I added are #11, #22.

The only problem I see at the moment, is that the similarities
implementations are using the num of users per item in the
item-item similarity calculation. This _can_ be mitigated by creating an
additional Map in the DataModel which maps itemID to numUsers.

What do you think about the proposed solution? Perhaps I am missing some
other implications?

Thanks!

On Fri, Dec 2, 2011 at 12:51 AM, Sean Owen <[email protected]> wrote:

> (Agree, and the sampling happens at the user level now -- so if you sample
> one of these users, it slows down a lot. The spirit of the proposed change
> is to make sampling more fine-grained, at the individual item level. That
> seems to certainly fix this.)
>
> On Thu, Dec 1, 2011 at 10:46 PM, Ted Dunning <[email protected]>
> wrote:
>
> > This may or may not help much.  My guess is that the improvement will be
> > very modest.
> >
> > The most serious problem is going to be recommendations for anybody who
> has
> > rated one of these excessively popular items.  That item will bring in a
> > huge number of other users and thus a huge number of items to consider.
>  If
> > you down-sample ratings of the prolific users and kill super-common
> items,
> > I think you will see much more improvement than simply eliminating the
> > singleton users.
> >
> > The basic issue is that cooccurrence based algorithms have run-time
> > proportional to O(n_max^2) where n_max is the maximum number of items per
> > user.
> >
> > On Thu, Dec 1, 2011 at 2:35 PM, Daniel Zohar <[email protected]> wrote:
> >
> > > This is why I'm looking now into improving GenericBooleanPrefDataModel
> to
> > > not take into account users which made one interaction under the
> > > 'preferenceForItems' Map. What do you think about this approach?
> > >
> >
>

Re: Mahout performance issues

Reply via email to