Re: Mahout performance issues

Daniel Zohar Thu, 01 Dec 2011 07:24:54 -0800

Here is the correct snapshot - http://static.inky.ws/image/937/image.jpg
If I read this correctly, the FastIDSet.intersectionSize takes most of the
time.
I really think this can be extremely improved if the following code
in GenericBooleanPrefDataModel.getNumUsersWithPreferenceFor will return
only users which have made a choice for more than one item:


FastIDSet userIDs1 = preferenceForItems.get(itemID1);

.....

FastIDSet userIDs2 = preferenceForItems.get(itemID2);

.....

return userIDs1.intersectionSize(userIDs2);

@Sebastian - No it's the other way around, ~8.5M users had only chosen a
single item. The item with the users associated is about 400k.
@Sean, Yes, my soution can solves the problem, but I feel that the
optimization mentioned above can really boost the performance, and I think
that can contribute for all of Mahout's users. What do you think?


On Thu, Dec 1, 2011 at 5:16 PM, Sebastian Schelter <[email protected]> wrote:

> If I remember correctly, you 12M users and 18M interactions.
>
> If I interpret the plots correctly there is one single item that
> accounts for 8.5M interactions (nearly half of the overall interactions)
> and more than two thirds of the users like it?
>
> --sebastian
>
> On 01.12.2011 16:12, Sean Owen wrote:
> > You can 'tickle' the cache asynchronously if you like.
> >
> > I am still not clear on why you are doing so many item-item similarity
> > calculations. Your change ought to let you do 1, or 10, or 100 per
> > calculation if you like. That, we know, is fast. And a few hundred
> > similarities should start to give reasonable recommendations.
> >
> > What is preventing you from making this tradeoff (with your change)?
> > Yes, it is essential for reasonable performance.
> >
> > On Thu, Dec 1, 2011 at 3:06 PM, Daniel Zohar <[email protected]> wrote:
> >
> >> Hi Manuel,
> >> I haven't got to the point where CacheItemSimilarity kicks in. That is,
> I
> >> will have to run a lot of recommendations in order to get a real benefit
> >> from it. I would first like to optimize the 'cold start' so it's at
> least
> >> serves at reasonable time. Usually cache is used to prevent repeated
> >> calculations, but personally I dont think it's a replacement for
> optimized
> >> performance. Don't you agree?
> >>
> >> Also, I will try to profile the app now as you suggest and send the
> results
> >> asap.
> >>
> >> Thanks!
> >>
> >> On Thu, Dec 1, 2011 at 4:56 PM, Manuel Blechschmidt <
> >> [email protected]> wrote:
> >>
> >>> Hi Daniel,
> >>> actually you are running the profile inside tomcat. You should take a
> >>> snapshot and then drill down to the functions where the actual
> >>> recommendation takes place. The current screenshots also contains some
> >>> profiles from Tomcat threads which are sleeping a lot and therefore
> >> taking
> >>> a lot of time.
> >>>
> >>> Further the screenshots does not contain the amount how often the
> >>> different functions are called.
> >>>
> >>> You have to profile multiple requests alone. The CacheItemSimilarity
> gets
> >>> filled therefore it should go faster and faster.
> >>>
> >>> On 01.12.2011, at 15:11, Daniel Zohar wrote:
> >>>
> >>>> @Manuel thanks for the tips. I have installed VisualVM and followed
> are
> >>> the
> >>>> results
> >>>> I did two sampling -
> >>>> - With the optimized SamplingCandidateItemsStrategy (
> >>>> http://pastebin.com/6n9C8Pw1):
> >> http://static.inky.ws/image/934/image.jpg
> >>>> - Without the optimized SamplingCandidateItemsStrategy:
> >>>> http://static.inky.ws/image/935/image.jpg
> >>>>
> >>>
> >>> The big hot spot is the function FastIDSet.find():
> >>>
> >>> Optimized: 13,759 s
> >>> Unoptimized: 246,487 s
> >>>
> >>> So you see that your optimization already got you a performance boost
> of
> >>> 2000%.
> >>>
> >>> Did you play around with the CacheItemSimilarity cache sizes?
> >>>
> >>> /Manuel
> >>>
> >>> --
> >>> Manuel Blechschmidt
> >>> Dortustr. 57
> >>> 14467 Potsdam
> >>> Mobil: 0173/6322621
> >>> Twitter: http://twitter.com/Manuel_B
> >>>
> >>>
> >>
> >
>
>

Re: Mahout performance issues

Reply via email to