4GB can fit one or two types of item similarities, however I have couple more based on different similarity measurements.
For user-user similarity, I don't think we can compute them and cache them at the run-time because of high memory consumption. As you know the data model (preferences) alone can't fit in 4GB memory. I will try SVD and ALS. Are they good for both user-based and item-based recommendations? Thanks. On Thu, Jun 21, 2012 at 4:55 PM, Sean Owen <[email protected]> wrote: > OK, you're already pruning a fair bit then, in the sense that you keep > top 50 similarities (by absolute value) per item. More is probably not > productive as you're already keeping only a small fraction of all of > them. > > (100M pairs and ~20 bytes needed per pair... should get in about 2GB > of heap. That's a lot of the 4GB you have available but seems like it > ought to about fit? are you giving Java enough heap? here are my > general default settings for this kind of app -- applicable here too: > http://myrrix.com/documentation-serving-layer/) > > You just have a load of items. Any process that scales as the square > of the number of items is going to hurt when you get to millions of > them. A process based on user-user similarity, when there are 40M, is > only going to be much worse. > > Consider not pre-computing all these pairs. Compute them and cache > them in real-time. Instead use the CandidateItemStrategy to > significantly reduce the number of item-item similarities you need to > look at. That may mitigate the fact that you don't have them all in > memory. > > > You can throw more hardware at this, if you're willing to move to a > completely batch-oriented Hadoop-based computation. You won't be > limited by RAM but it will be an offline process. > > > I am a big fan of matrix-factorization-based at the moment since you > can run most of the computation offline whenever you like, but still > make real-time approximate updates. These sorts of things only scale > linearly with the number of items and users, and not even with the > size of the pref input. I think you may have to shoot for this kind of > hybrid system in the end to do updates in real-time. > > > On Thu, Jun 21, 2012 at 11:26 PM, Way Cool <[email protected]> wrote: > > Thanks guys for your quick response. > > > > We have a couple millions of items and 40 millions users (including > > anonymous users). Up to 50 items were generated per item. > > > > I will try minimum similarity. Is there any document or a parameter > defined > > in itemsimilarity job? > > > > What about user-based recommendation? Any ideas how we can make that > happen > > without loading everything in memory? > > > > Thanks. > > > > > > On Thu, Jun 21, 2012 at 3:29 PM, Sean Owen <[email protected]> wrote: > > > >> I would suggest pruning similarities near 0, and then treating missing > >> similarities as 0 later at runtime. It may take a bit of coding. But > >> you should be able to throw away a lot without compromising much of > >> the result. > >> > >> On Thu, Jun 21, 2012 at 10:16 PM, Way Cool <[email protected]> > wrote: > >> > Hi, guys, > >> > > >> > For item-based recommendation, I pre-calculated the item similarities > on > >> > Hadoop per algorithm, which generated 20m rows each. The problem now > is I > >> > can't just load them into memory via MySQLJDBCInMemoryItemSimilarity > with > >> > 4GB memory. I tried MySQLJDBCItemSimilarity, however it's way too > slow. > >> > What are the alternatives? > >> > > >> > For user-based recommendation, I can't load 100m lines of data model > from > >> > FileDataModel into memory. It ran out of memory after 20m lines. The > same > >> > issue with JDBCDataModel is way too slow. Does anyone precalculate the > >> user > >> > similarities before and recommend items to a user? > >> > > >> > Anyone had the similar issues before? > >> > > >> > Thanks, > >> > > >> > YG > >> >
