Hello, This project was put on hold for a while so I only had a time to look into it recently. I was thinking about the idea of down-sampling and different sampling strategies.
What would be the minimum rate of sampling the users? Right now I sample 1 in 256 users. But if there will be only 400 users in a group I will not get as good estimate as if there would have 10k users. I am trying to find out here the strategy for downsampling. I was hoping there should be some statistical way of estimating sampling ratio? Cheers, Radek On 18 February 2011 18:04, Sebastian Schelter <[email protected]> wrote: > This shouldn't be too difficult and would maybe make a good newcomer or > student project. > > --sebastian > > Am 18.02.2011 18:19, schrieb Ted Dunning: > > A better way to sample is to find groups with a very large number of > users > > and downsample the number of users to a maximum of about 1000 (or even > 200 > > if you want to be more aggressive). Do the same with users. > > > > That won't delete a whole lot data volume, but it will make most > > recommendation algorithms go much faster. The idea is that after you > have > > 200 or more users in a group, you aren't learning anything new anyway. > > > > On Fri, Feb 18, 2011 at 7:41 AM, Radek Maciaszek > > <[email protected]>wrote: > > > >> Each user can belong to > >> many groups so the number of combinations is rather big. In fact this > >> number > >> of combinations is so large I am considering to sample the users and > only > >> analyse 1 in about 256 users. So essentially I would have about 1000+ > >> groups > >> and about 150k users. Since one user can potentially belong to many > dozens > >> of groups this will easily go into millions of records anyway but > perhaps > >> will be lower than 100M margin you mentioned. > >> > > > >
