A better way to sample is to find groups with a very large number of users and downsample the number of users to a maximum of about 1000 (or even 200 if you want to be more aggressive). Do the same with users.
That won't delete a whole lot data volume, but it will make most recommendation algorithms go much faster. The idea is that after you have 200 or more users in a group, you aren't learning anything new anyway. On Fri, Feb 18, 2011 at 7:41 AM, Radek Maciaszek <[email protected]>wrote: > Each user can belong to > many groups so the number of combinations is rather big. In fact this > number > of combinations is so large I am considering to sample the users and only > analyse 1 in about 256 users. So essentially I would have about 1000+ > groups > and about 150k users. Since one user can potentially belong to many dozens > of groups this will easily go into millions of records anyway but perhaps > will be lower than 100M margin you mentioned. >
