OK, you're already pruning a fair bit then, in the sense that you keep top 50 similarities (by absolute value) per item. More is probably not productive as you're already keeping only a small fraction of all of them.
(100M pairs and ~20 bytes needed per pair... should get in about 2GB of heap. That's a lot of the 4GB you have available but seems like it ought to about fit? are you giving Java enough heap? here are my general default settings for this kind of app -- applicable here too: http://myrrix.com/documentation-serving-layer/) You just have a load of items. Any process that scales as the square of the number of items is going to hurt when you get to millions of them. A process based on user-user similarity, when there are 40M, is only going to be much worse. Consider not pre-computing all these pairs. Compute them and cache them in real-time. Instead use the CandidateItemStrategy to significantly reduce the number of item-item similarities you need to look at. That may mitigate the fact that you don't have them all in memory. You can throw more hardware at this, if you're willing to move to a completely batch-oriented Hadoop-based computation. You won't be limited by RAM but it will be an offline process. I am a big fan of matrix-factorization-based at the moment since you can run most of the computation offline whenever you like, but still make real-time approximate updates. These sorts of things only scale linearly with the number of items and users, and not even with the size of the pref input. I think you may have to shoot for this kind of hybrid system in the end to do updates in real-time. On Thu, Jun 21, 2012 at 11:26 PM, Way Cool <[email protected]> wrote: > Thanks guys for your quick response. > > We have a couple millions of items and 40 millions users (including > anonymous users). Up to 50 items were generated per item. > > I will try minimum similarity. Is there any document or a parameter defined > in itemsimilarity job? > > What about user-based recommendation? Any ideas how we can make that happen > without loading everything in memory? > > Thanks. > > > On Thu, Jun 21, 2012 at 3:29 PM, Sean Owen <[email protected]> wrote: > >> I would suggest pruning similarities near 0, and then treating missing >> similarities as 0 later at runtime. It may take a bit of coding. But >> you should be able to throw away a lot without compromising much of >> the result. >> >> On Thu, Jun 21, 2012 at 10:16 PM, Way Cool <[email protected]> wrote: >> > Hi, guys, >> > >> > For item-based recommendation, I pre-calculated the item similarities on >> > Hadoop per algorithm, which generated 20m rows each. The problem now is I >> > can't just load them into memory via MySQLJDBCInMemoryItemSimilarity with >> > 4GB memory. I tried MySQLJDBCItemSimilarity, however it's way too slow. >> > What are the alternatives? >> > >> > For user-based recommendation, I can't load 100m lines of data model from >> > FileDataModel into memory. It ran out of memory after 20m lines. The same >> > issue with JDBCDataModel is way too slow. Does anyone precalculate the >> user >> > similarities before and recommend items to a user? >> > >> > Anyone had the similar issues before? >> > >> > Thanks, >> > >> > YG >>
