Hi Sean, Thanks for so many ideas, I will look into these. Unfortunately the amount of data we are dealing with is quite substantial. There is about 1000+ groups and about 40 millions of users to analyse. Moreover the business need is to have eventually even bigger number of groups. Each user can belong to many groups so the number of combinations is rather big. In fact this number of combinations is so large I am considering to sample the users and only analyse 1 in about 256 users. So essentially I would have about 1000+ groups and about 150k users. Since one user can potentially belong to many dozens of groups this will easily go into millions of records anyway but perhaps will be lower than 100M margin you mentioned.
Yesterday I wasn't sure if my existing cluster is big enough for this and now I'm tempted to try to do this on one machine. Nice one. Cheers, Radek On 18 February 2011 15:13, Sean Owen <[email protected]> wrote: > This looks like a simple collaborative filtering problem, or at least > can be solved that way. It's not even recommendation, just an item > similarity problem. > > Users are users and groups are items. You are just computing item-item > similarity based on some metric and there are several implemented in > the library. > > Forget Hadoop for now as I doubt this is nearly of the scale where you > need it. For a quick solution, make a file of "userID,groupID" entries > for every membership. Create a FileDataModel on top of it. Then > instantiate LogLikelihoodtemSimilarity on top of that for example. It > will score the "simiarlity" between any two groups based on > membership. The result is between 0 and 1. > > On Thu, Feb 17, 2011 at 2:34 PM, Radek Maciaszek > <[email protected]> wrote: > > Hello, > > > > I have a following problem and I am trying to figure out if using Mahout > is > > a good idea for this or perhaps there may be a much simpler approach. > > > > Consider I have users who can belong to many groups: > > user1: group1, group2 > > user2: group2 > > user3: group2, group3 > > ... and millions more > > > > I am trying to find a similarities between the groups (not the users). > Some > > simple similarity metric (e.g. 0-1, close to 0 for not similar at all, > close > > to 1 very similar) would be ideal. So essentially I need to calculate > such a > > metric for every pair of groups. > > > > Is it something Mahout can help me with? > > > > Many thanks, > > Radek > > >
