This sounds a lot like a bug that was fixed by a patch some time ago. Grant I think it was something I had wanted you to double-check, not sure if you had a look. But I think it was fixed if it's the same issue.
On Thu, Aug 2, 2012 at 8:44 AM, Abramov Pavel <[email protected]>wrote: > Thanks for this idea. > > Looks like a bug: > 1) Setting --maxDFPercent to 100 has no effect > 2) Setting --maxDFPercent to 1 000 000 000 makes TFIDF vectors Ok. > > seq2sparse cuts terms with DF > maxDFPercent. So maxDFPercent is not a > percentage. maxDFPercent is absolute value. > > > Pavel > > > > > 01.08.12 20:46 пользователь "Robin Anil" <[email protected]> написал: > > >Tfidf job is where the document frequency pruning is applied. Try > >increasing maxDFPercent to 100 % > > > >On Wed, Aug 1, 2012 at 11:22 AM, Abramov Pavel > ><[email protected]>wrote: > > > >> Hello! > >> > >> I have trouble running the example "seq2sparse" with TFIDF weights. My > >>TF > >> vectors are Ok, while TFIDF vectors are 10 times smaller. Looks like > >> seq2sparse cuts my terms during TFxIDF step. Document1 in TF vector has > >>20 > >> terms, while Document1 in TFIDF vector > >> has only 2 terms. What is wrong? I spent 2 days finding the answer and > >> configuring seq2sparse parameters (( > >> > >> Thanks in advance! > >> > >> mahout seq2sparse -ow \ > >> -chunk 512 \ > >> --maxDFPercent 90 \ > >> --maxNGramSize 1 \ > >> --numReducers 128 \ > >> --minSupport 150 \ > >> -i --- \ > >> -o --- \ > >> -wt tfidf \ > >> --namedVector \ > >> -a org.apache.lucene.analysis.WhitespaceAnalyzer > >> > >> Pavel > >> > >> > >
