"reservoir sampling" lets you make good per-user sample sets. This has code demonstrating the approach.
https://issues.apache.org/jira/browse/MAHOUT-676 How to do this in an efficient way? No idea. On Sat, Jul 2, 2011 at 9:18 AM, Ted Dunning <[email protected]> wrote: > Don't sample at a constant rate. > > Either downsample user ratings so that no user has more than a reasonable > number of ratings or downsample users so that no thing has more than a > reasonable number of users rating it. > > I generally prefer the former, but either should be fine. > > On Sat, Jul 2, 2011 at 3:47 AM, Radek Maciaszek <[email protected]>wrote: > >> Hello, >> >> This project was put on hold for a while so I only had a time to look into >> it recently. I was thinking about the idea of down-sampling and different >> sampling strategies. >> >> What would be the minimum rate of sampling the users? Right now I sample 1 >> in 256 users. But if there will be only 400 users in a group I will not get >> as good estimate as if there would have 10k users. I am trying to find out >> here the strategy for downsampling. >> >> I was hoping there should be some statistical way of estimating sampling >> ratio? >> >> Cheers, >> Radek >> >> On 18 February 2011 18:04, Sebastian Schelter <[email protected]> wrote: >> >> > This shouldn't be too difficult and would maybe make a good newcomer or >> > student project. >> > >> > --sebastian >> > >> > Am 18.02.2011 18:19, schrieb Ted Dunning: >> > > A better way to sample is to find groups with a very large number of >> > users >> > > and downsample the number of users to a maximum of about 1000 (or even >> > 200 >> > > if you want to be more aggressive). Do the same with users. >> > > >> > > That won't delete a whole lot data volume, but it will make most >> > > recommendation algorithms go much faster. The idea is that after you >> > have >> > > 200 or more users in a group, you aren't learning anything new anyway. >> > > >> > > On Fri, Feb 18, 2011 at 7:41 AM, Radek Maciaszek >> > > <[email protected]>wrote: >> > > >> > >> Each user can belong to >> > >> many groups so the number of combinations is rather big. In fact this >> > >> number >> > >> of combinations is so large I am considering to sample the users and >> > only >> > >> analyse 1 in about 256 users. So essentially I would have about 1000+ >> > >> groups >> > >> and about 150k users. Since one user can potentially belong to many >> > dozens >> > >> of groups this will easily go into millions of records anyway but >> > perhaps >> > >> will be lower than 100M margin you mentioned. >> > >> >> > > >> > >> > >> > -- Lance Norskog [email protected]
