Re: RowSimilarity error

Pat Ferrel Wed, 11 Jul 2012 10:36:10 -0700

I'm have a custom lucene stemming analyzer that filters out stop wordsand uses the following seq2sparse. The -x 40 is the only other thingthat affects tossing frequent terms and as I understand things, tossesany term that appears in over 40% of the docs.


mahout seq2sparse \
    -i b2/seqfiles/ \
    -o b2/vectors/ \
    -ow \
    -chunk 2000 \
    -x 40 \
    -seq \
    -n 2 \
    -nv \
    -a com.finderbots.analyzers.LuceneStemmingAnalyzer



On 7/11/12 9:18 AM, Sebastian Schelter wrote:

Hi Pat,

have you removed highly frequent terms before launching rowsimilarity job?

On 11.07.2012 18:14, Pat Ferrel wrote:

I've been trying to get a rowsimilarity job to complete. It continues to
timeout on a RowSimilarityJob-CooccurrencesMapper-Reducer task so I've
upped the timeout to 30 minutes now. There are no errors in the logs
that I can see and no other task I've tried is acting like this. Is this
expected? Shouldn't the task check in more often?

It's doing 34,000 docs with 40 sim docs each on 8 cores so it is a bit
slow anyway, still I shouldn't have to turn up the timeout so high
should I?

Re: RowSimilarity error

Reply via email to