Thanks for this idea.

Looks like a bug:
1) Setting --maxDFPercent to 100 has no effect
2) Setting --maxDFPercent to 1 000 000 000 makes TFIDF vectors Ok.

seq2sparse cuts terms with DF > maxDFPercent. So maxDFPercent is not a
percentage. maxDFPercent is absolute value.


Pavel




01.08.12 20:46 пользователь "Robin Anil" <[email protected]> написал:

>Tfidf job is where the document frequency pruning is applied. Try
>increasing maxDFPercent to 100 %
>
>On Wed, Aug 1, 2012 at 11:22 AM, Abramov Pavel
><[email protected]>wrote:
>
>> Hello!
>>
>> I have trouble running the example "seq2sparse" with TFIDF weights. My
>>TF
>> vectors are Ok, while TFIDF vectors are 10 times smaller. Looks like
>> seq2sparse cuts my terms during TFxIDF step. Document1 in TF vector has
>>20
>> terms, while Document1 in TFIDF vector
>>  has only 2 terms. What is wrong? I spent 2 days finding the answer and
>> configuring seq2sparse parameters ((
>>
>> Thanks in advance!
>>
>> mahout seq2sparse -ow  \
>> -chunk 512 \
>> --maxDFPercent 90 \
>> --maxNGramSize 1 \
>> --numReducers 128 \
>> --minSupport 150 \
>> -i --- \
>> -o --- \
>> -wt tfidf \
>> --namedVector \
>> -a org.apache.lucene.analysis.WhitespaceAnalyzer
>>
>> Pavel
>>
>>

Reply via email to