Yes this is a known bug. Grant, I had an open question to you on this
one -- what do you think about the fix?

On Wed, Jun 27, 2012 at 3:11 PM, Yuval Feinstein <[email protected]> wrote:
> I believe I found the problem.
> As this contradicts Mahout's documentation, it might be a bug in
> Mahout 0.6 or the documentation should be changed.
> The max document frequency percentage parameter in seq2sparse (-x or
> --maxDFPercent) does not behave like a percent,
> but rather like an absolute df count.
> The default value is still 99.
> For example, assume we have 1000 documents overall, with the three
> following terms:
> pen - whose df=80
> pencil - whose df=120
> feather - whose df=999.
> Assume that the -x parameter has the default value of 99.
> According to the documentation, only "feather" should be filtered out.
> But my experiments show that both "pencil" and "feather" are filtered out.
> If we treat the -x parameter as an absolute number, and set it to 990,
> we get the expected behavior of filtering only "feather".
> So I suggest that you set the parameter to the total number of
> documents in your collection, and see if that helps.
> Cheers,
> Yuval

Reply via email to