Thanks for this idea. Looks like a bug: 1) Setting --maxDFPercent to 100 has no effect 2) Setting --maxDFPercent to 1 000 000 000 makes TFIDF vectors Ok.
seq2sparse cuts terms with DF > maxDFPercent. So maxDFPercent is not a percentage. maxDFPercent is absolute value. Pavel 01.08.12 20:46 пользователь "Robin Anil" <[email protected]> написал: >Tfidf job is where the document frequency pruning is applied. Try >increasing maxDFPercent to 100 % > >On Wed, Aug 1, 2012 at 11:22 AM, Abramov Pavel ><[email protected]>wrote: > >> Hello! >> >> I have trouble running the example "seq2sparse" with TFIDF weights. My >>TF >> vectors are Ok, while TFIDF vectors are 10 times smaller. Looks like >> seq2sparse cuts my terms during TFxIDF step. Document1 in TF vector has >>20 >> terms, while Document1 in TFIDF vector >> has only 2 terms. What is wrong? I spent 2 days finding the answer and >> configuring seq2sparse parameters (( >> >> Thanks in advance! >> >> mahout seq2sparse -ow \ >> -chunk 512 \ >> --maxDFPercent 90 \ >> --maxNGramSize 1 \ >> --numReducers 128 \ >> --minSupport 150 \ >> -i --- \ >> -o --- \ >> -wt tfidf \ >> --namedVector \ >> -a org.apache.lucene.analysis.WhitespaceAnalyzer >> >> Pavel >> >>
