If you are using rootLLR, then a threshold of 10 represents (roughly) 10 standard deviations. This is a big threshold.
What I generally do is threshold to a level that either makes the resulting pairs be composed of a weak majority of plausible terms (if I understand the domain) or simply to drive to a level of sparsity. Both seem about the same. I also pretty much always also down-sample the number of items per user. This has two motivations. One is simply pragmatics. The other is that it decreases the influence of bots and other pathological users. On Mon, Feb 11, 2013 at 2:57 AM, Johannes Schulte < [email protected]> wrote: > > > I am also thresholding the counts with LLR. Every time i do this I take a > threshold of 10 since I loosely remember it being about the 99% margin of > confidence in the chi square distribution. I got no clue however if anybody > wants something like 99% for recommendations or if 50% might be a better > value. What's your experience on that? > > And do you apply a limit on the total number of docs per term, since there > could be big boolean queries tearing down the performance? > > Thanks for all the input!
