I believe I found the problem.
As this contradicts Mahout's documentation, it might be a bug in
Mahout 0.6 or the documentation should be changed.
The max document frequency percentage parameter in seq2sparse (-x or
--maxDFPercent) does not behave like a percent,
but rather like an absolute df count.
The default value is still 99.
For example, assume we have 1000 documents overall, with the three
following terms:
pen - whose df=80
pencil - whose df=120
feather - whose df=999.
Assume that the -x parameter has the default value of 99.
According to the documentation, only "feather" should be filtered out.
But my experiments show that both "pencil" and "feather" are filtered out.
If we treat the -x parameter as an absolute number, and set it to 990,
we get the expected behavior of filtering only "feather".
So I suggest that you set the parameter to the total number of
documents in your collection, and see if that helps.
Cheers,
Yuval

Reply via email to