Here is the correct snapshot - http://static.inky.ws/image/937/image.jpg If I read this correctly, the FastIDSet.intersectionSize takes most of the time. I really think this can be extremely improved if the following code in GenericBooleanPrefDataModel.getNumUsersWithPreferenceFor will return only users which have made a choice for more than one item:
FastIDSet userIDs1 = preferenceForItems.get(itemID1); ..... FastIDSet userIDs2 = preferenceForItems.get(itemID2); ..... return userIDs1.intersectionSize(userIDs2); @Sebastian - No it's the other way around, ~8.5M users had only chosen a single item. The item with the users associated is about 400k. @Sean, Yes, my soution can solves the problem, but I feel that the optimization mentioned above can really boost the performance, and I think that can contribute for all of Mahout's users. What do you think? On Thu, Dec 1, 2011 at 5:16 PM, Sebastian Schelter <[email protected]> wrote: > If I remember correctly, you 12M users and 18M interactions. > > If I interpret the plots correctly there is one single item that > accounts for 8.5M interactions (nearly half of the overall interactions) > and more than two thirds of the users like it? > > --sebastian > > On 01.12.2011 16:12, Sean Owen wrote: > > You can 'tickle' the cache asynchronously if you like. > > > > I am still not clear on why you are doing so many item-item similarity > > calculations. Your change ought to let you do 1, or 10, or 100 per > > calculation if you like. That, we know, is fast. And a few hundred > > similarities should start to give reasonable recommendations. > > > > What is preventing you from making this tradeoff (with your change)? > > Yes, it is essential for reasonable performance. > > > > On Thu, Dec 1, 2011 at 3:06 PM, Daniel Zohar <[email protected]> wrote: > > > >> Hi Manuel, > >> I haven't got to the point where CacheItemSimilarity kicks in. That is, > I > >> will have to run a lot of recommendations in order to get a real benefit > >> from it. I would first like to optimize the 'cold start' so it's at > least > >> serves at reasonable time. Usually cache is used to prevent repeated > >> calculations, but personally I dont think it's a replacement for > optimized > >> performance. Don't you agree? > >> > >> Also, I will try to profile the app now as you suggest and send the > results > >> asap. > >> > >> Thanks! > >> > >> On Thu, Dec 1, 2011 at 4:56 PM, Manuel Blechschmidt < > >> [email protected]> wrote: > >> > >>> Hi Daniel, > >>> actually you are running the profile inside tomcat. You should take a > >>> snapshot and then drill down to the functions where the actual > >>> recommendation takes place. The current screenshots also contains some > >>> profiles from Tomcat threads which are sleeping a lot and therefore > >> taking > >>> a lot of time. > >>> > >>> Further the screenshots does not contain the amount how often the > >>> different functions are called. > >>> > >>> You have to profile multiple requests alone. The CacheItemSimilarity > gets > >>> filled therefore it should go faster and faster. > >>> > >>> On 01.12.2011, at 15:11, Daniel Zohar wrote: > >>> > >>>> @Manuel thanks for the tips. I have installed VisualVM and followed > are > >>> the > >>>> results > >>>> I did two sampling - > >>>> - With the optimized SamplingCandidateItemsStrategy ( > >>>> http://pastebin.com/6n9C8Pw1): > >> http://static.inky.ws/image/934/image.jpg > >>>> - Without the optimized SamplingCandidateItemsStrategy: > >>>> http://static.inky.ws/image/935/image.jpg > >>>> > >>> > >>> The big hot spot is the function FastIDSet.find(): > >>> > >>> Optimized: 13,759 s > >>> Unoptimized: 246,487 s > >>> > >>> So you see that your optimization already got you a performance boost > of > >>> 2000%. > >>> > >>> Did you play around with the CacheItemSimilarity cache sizes? > >>> > >>> /Manuel > >>> > >>> -- > >>> Manuel Blechschmidt > >>> Dortustr. 57 > >>> 14467 Potsdam > >>> Mobil: 0173/6322621 > >>> Twitter: http://twitter.com/Manuel_B > >>> > >>> > >> > > > >
