Yes this is a known bug. Grant, I had an open question to you on this one -- what do you think about the fix?
On Wed, Jun 27, 2012 at 3:11 PM, Yuval Feinstein <[email protected]> wrote: > I believe I found the problem. > As this contradicts Mahout's documentation, it might be a bug in > Mahout 0.6 or the documentation should be changed. > The max document frequency percentage parameter in seq2sparse (-x or > --maxDFPercent) does not behave like a percent, > but rather like an absolute df count. > The default value is still 99. > For example, assume we have 1000 documents overall, with the three > following terms: > pen - whose df=80 > pencil - whose df=120 > feather - whose df=999. > Assume that the -x parameter has the default value of 99. > According to the documentation, only "feather" should be filtered out. > But my experiments show that both "pencil" and "feather" are filtered out. > If we treat the -x parameter as an absolute number, and set it to 990, > we get the expected behavior of filtering only "feather". > So I suggest that you set the parameter to the total number of > documents in your collection, and see if that helps. > Cheers, > Yuval
