4GB can fit one or two types of item similarities, however I have couple
more based on different similarity measurements.

For user-user similarity, I don't think we can compute them and cache them
at the run-time because of high memory consumption. As you know the data
model (preferences) alone can't fit in 4GB memory.

I will try SVD and ALS. Are they good for both user-based and item-based
recommendations?

Thanks.


On Thu, Jun 21, 2012 at 4:55 PM, Sean Owen <[email protected]> wrote:

> OK, you're already pruning a fair bit then, in the sense that you keep
> top 50 similarities (by absolute value) per item. More is probably not
> productive as you're already keeping only a small fraction of all of
> them.
>
> (100M pairs and ~20 bytes needed per pair... should get in about 2GB
> of heap. That's a lot of the 4GB you have available but seems like it
> ought to about fit? are you giving Java enough heap? here are my
> general default settings for this kind of app -- applicable here too:
> http://myrrix.com/documentation-serving-layer/)
>
> You just have a load of items. Any process that scales as the square
> of the number of items is going to hurt when you get to millions of
> them. A process based on user-user similarity, when there are 40M, is
> only going to be much worse.
>
> Consider not pre-computing all these pairs. Compute them and cache
> them in real-time. Instead use the CandidateItemStrategy to
> significantly reduce the number of item-item similarities you need to
> look at. That may mitigate the fact that you don't have them all in
> memory.
>
>
> You can throw more hardware at this, if you're willing to move to a
> completely batch-oriented Hadoop-based computation. You won't be
> limited by RAM but it will be an offline process.
>
>
> I am a big fan of matrix-factorization-based at the moment since you
> can run most of the computation offline whenever you like, but still
> make real-time approximate updates. These sorts of things only scale
> linearly with the number of items and users, and not even with the
> size of the pref input. I think you may have to shoot for this kind of
> hybrid system in the end to do updates in real-time.
>
>
> On Thu, Jun 21, 2012 at 11:26 PM, Way Cool <[email protected]> wrote:
> > Thanks guys for your quick response.
> >
> > We have a couple millions of items and 40 millions users (including
> > anonymous users). Up to 50 items were generated per item.
> >
> > I will try minimum similarity. Is there any document or a parameter
> defined
> > in itemsimilarity job?
> >
> > What about user-based recommendation? Any ideas how we can make that
> happen
> > without loading everything in memory?
> >
> > Thanks.
> >
> >
> > On Thu, Jun 21, 2012 at 3:29 PM, Sean Owen <[email protected]> wrote:
> >
> >> I would suggest pruning similarities near 0, and then treating missing
> >> similarities as 0 later at runtime. It may take a bit of coding. But
> >> you should be able to throw away a lot without compromising much of
> >> the result.
> >>
> >> On Thu, Jun 21, 2012 at 10:16 PM, Way Cool <[email protected]>
> wrote:
> >> > Hi, guys,
> >> >
> >> > For item-based recommendation, I pre-calculated the item similarities
> on
> >> > Hadoop per algorithm, which generated 20m rows each. The problem now
> is I
> >> > can't just load them into memory via MySQLJDBCInMemoryItemSimilarity
> with
> >> > 4GB memory. I tried MySQLJDBCItemSimilarity, however it's way too
> slow.
> >> > What are the alternatives?
> >> >
> >> > For user-based recommendation, I can't load 100m lines of data model
> from
> >> > FileDataModel into memory. It ran out of memory after 20m lines. The
> same
> >> > issue with JDBCDataModel is way too slow. Does anyone precalculate the
> >> user
> >> > similarities before and recommend items to a user?
> >> >
> >> > Anyone had the similar issues before?
> >> >
> >> > Thanks,
> >> >
> >> > YG
> >>
>

Reply via email to