Don't sample at a constant rate. Either downsample user ratings so that no user has more than a reasonable number of ratings or downsample users so that no thing has more than a reasonable number of users rating it.
I generally prefer the former, but either should be fine. On Sat, Jul 2, 2011 at 3:47 AM, Radek Maciaszek <[email protected]>wrote: > Hello, > > This project was put on hold for a while so I only had a time to look into > it recently. I was thinking about the idea of down-sampling and different > sampling strategies. > > What would be the minimum rate of sampling the users? Right now I sample 1 > in 256 users. But if there will be only 400 users in a group I will not get > as good estimate as if there would have 10k users. I am trying to find out > here the strategy for downsampling. > > I was hoping there should be some statistical way of estimating sampling > ratio? > > Cheers, > Radek > > On 18 February 2011 18:04, Sebastian Schelter <[email protected]> wrote: > > > This shouldn't be too difficult and would maybe make a good newcomer or > > student project. > > > > --sebastian > > > > Am 18.02.2011 18:19, schrieb Ted Dunning: > > > A better way to sample is to find groups with a very large number of > > users > > > and downsample the number of users to a maximum of about 1000 (or even > > 200 > > > if you want to be more aggressive). Do the same with users. > > > > > > That won't delete a whole lot data volume, but it will make most > > > recommendation algorithms go much faster. The idea is that after you > > have > > > 200 or more users in a group, you aren't learning anything new anyway. > > > > > > On Fri, Feb 18, 2011 at 7:41 AM, Radek Maciaszek > > > <[email protected]>wrote: > > > > > >> Each user can belong to > > >> many groups so the number of combinations is rather big. In fact this > > >> number > > >> of combinations is so large I am considering to sample the users and > > only > > >> analyse 1 in about 256 users. So essentially I would have about 1000+ > > >> groups > > >> and about 150k users. Since one user can potentially belong to many > > dozens > > >> of groups this will easily go into millions of records anyway but > > perhaps > > >> will be lower than 100M margin you mentioned. > > >> > > > > > > > >
