Mathematically speaking, random sampling is just fine.  Stratifying based on
various criteria can help avoid loss of accuracy so if you had several
clusters then down sampling heavily represented clusters might work, but the
accurate definition of clusters is harder than the cooccurrence analysis
that you are trying to facilitate.

Separately down sampling each user is actually a form of this stratification
strategy.  The difference is that we know which users have a large number of
data points so that we can do the stratification accurately and very
cheaply.

On Wed, Aug 31, 2011 at 1:31 AM, Lance Norskog <[email protected]> wrote:

> "If you have a document (user) and a word (item), then you
> have a joint probability that any given interaction will be between this
> document and word.  We pretend in this case that each interaction is
> independent of every other which is patently not true, but very helpful."
>
> So if you subsample randomly to trim the data sizes, is it worth
> deliberately breaking correlations? Lets say that in a user/item dataset,
> almost all of the user/rating rows are in clusters of two or three
> according
> to the timestamp. Instead of just a random subsample, is it worth removing
> 1
> rating from each cluster? This would strengthen the Bayesian assumption.
>
>

Reply via email to