Mathematically speaking, random sampling is just fine. Stratifying based on various criteria can help avoid loss of accuracy so if you had several clusters then down sampling heavily represented clusters might work, but the accurate definition of clusters is harder than the cooccurrence analysis that you are trying to facilitate.
Separately down sampling each user is actually a form of this stratification strategy. The difference is that we know which users have a large number of data points so that we can do the stratification accurately and very cheaply. On Wed, Aug 31, 2011 at 1:31 AM, Lance Norskog <[email protected]> wrote: > "If you have a document (user) and a word (item), then you > have a joint probability that any given interaction will be between this > document and word. We pretend in this case that each interaction is > independent of every other which is patently not true, but very helpful." > > So if you subsample randomly to trim the data sizes, is it worth > deliberately breaking correlations? Lets say that in a user/item dataset, > almost all of the user/rating rows are in clusters of two or three > according > to the timestamp. Instead of just a random subsample, is it worth removing > 1 > rating from each cluster? This would strengthen the Bayesian assumption. > >
