The reminds of something I had in mind some weeks ago, we should invest
some work to give ItemSimilarityJob and RecommenderJob the ability to
use pluggable, customizable "sampling strategies".

This shouldn't be too difficult and would maybe make a good newcomer or
student project.

--sebastian

Am 18.02.2011 18:19, schrieb Ted Dunning:
> A better way to sample is to find groups with a very large number of users
> and downsample the number of users to a maximum of about 1000 (or even 200
> if you want to be more aggressive).  Do the same with users.
> 
> That won't delete a whole lot data volume, but it will make most
> recommendation algorithms go much faster.  The idea is that after you have
> 200 or more users in a group, you aren't learning anything new anyway.
> 
> On Fri, Feb 18, 2011 at 7:41 AM, Radek Maciaszek
> <[email protected]>wrote:
> 
>>  Each user can belong to
>> many groups so the number of combinations is rather big. In fact this
>> number
>> of combinations is so large I am considering to sample the users and only
>> analyse 1 in about 256 users. So essentially I would have about 1000+
>> groups
>> and about 150k users. Since one user can potentially belong to many dozens
>> of groups this will easily go into millions of records anyway but perhaps
>> will be lower than 100M margin you mentioned.
>>
> 

Reply via email to