I think the in-memory solution will work at that scale. You may have to increase the heap to 4GB or more (that is, may have to find a large-ish machine). But yes you can probably get just fine results by sampling even a fraction of the input, which definitely fits. That's the place to start I think
On Fri, Feb 18, 2011 at 6:41 PM, Radek Maciaszek <[email protected]> wrote: > Hi Sean, > > Thanks for so many ideas, I will look into these. Unfortunately the amount > of data we are dealing with is quite substantial. There is about 1000+ > groups and about 40 millions of users to analyse. Moreover the business need > is to have eventually even bigger number of groups. Each user can belong to > many groups so the number of combinations is rather big. In fact this number > of combinations is so large I am considering to sample the users and only > analyse 1 in about 256 users. So essentially I would have about 1000+ groups > and about 150k users. Since one user can potentially belong to many dozens > of groups this will easily go into millions of records anyway but perhaps > will be lower than 100M margin you mentioned. > > Yesterday I wasn't sure if my existing cluster is big enough for this and > now I'm tempted to try to do this on one machine. Nice one. > > Cheers, > Radek > > On 18 February 2011 15:13, Sean Owen <[email protected]> wrote: > >> This looks like a simple collaborative filtering problem, or at least >> can be solved that way. It's not even recommendation, just an item >> similarity problem. >> >> Users are users and groups are items. You are just computing item-item >> similarity based on some metric and there are several implemented in >> the library. >> >> Forget Hadoop for now as I doubt this is nearly of the scale where you >> need it. For a quick solution, make a file of "userID,groupID" entries >> for every membership. Create a FileDataModel on top of it. Then >> instantiate LogLikelihoodtemSimilarity on top of that for example. It >> will score the "simiarlity" between any two groups based on >> membership. The result is between 0 and 1. >> >> On Thu, Feb 17, 2011 at 2:34 PM, Radek Maciaszek >> <[email protected]> wrote: >> > Hello, >> > >> > I have a following problem and I am trying to figure out if using Mahout >> is >> > a good idea for this or perhaps there may be a much simpler approach. >> > >> > Consider I have users who can belong to many groups: >> > user1: group1, group2 >> > user2: group2 >> > user3: group2, group3 >> > ... and millions more >> > >> > I am trying to find a similarities between the groups (not the users). >> Some >> > simple similarity metric (e.g. 0-1, close to 0 for not similar at all, >> close >> > to 1 very similar) would be ideal. So essentially I need to calculate >> such a >> > metric for every pair of groups. >> > >> > Is it something Mahout can help me with? >> > >> > Many thanks, >> > Radek >> > >> >
