Re: RowSimilarity error

Pat Ferrel Thu, 12 Jul 2012 08:15:04 -0700

The counters from the last failed attempt are below. The current attemptis at 100% map where it usually times out. There are no countersavailable on that job yet.



 Counters for attempt_201207111741_0003_m_000000_0

------------------------------------------------------------------------

*FileSystemCounters*

        FILE_BYTES_READ         122,632,977,161

        HDFS_BYTES_READ         65,633,316

        FILE_BYTES_WRITTEN      185,546,210,152

*File Input Format Counters*

        Bytes Read      65,633,183

*org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters*


        PRUNED_COOCCURRENCES    0

        COOCCURRENCES   7,377,108,545

*Map-Reduce Framework*

        Map output materialized bytes   0

        Combine output records  4,434,644

        Map input records       76,150

        Physical memory (bytes) snapshot        183,169,024

        Spilled Records         13,303,899

        Map output bytes        79,238,005,171

        CPU time spent (ms)     30,767,320

        Total committed heap usage (bytes)      65,732,608

        Virtual memory (bytes) snapshot         3,817,648,128

        Combine input records   6,112,626

        Map output records      6,111,463

        SPLIT_RAW_BYTES         119


On 7/12/12 8:04 AM, Sebastian Schelter wrote:

Sorry, I overread that its more than one machine. Could you provide
the values for the counters from RowSimilarityJob (ROWS,
COOCCURRENCES, PRUNED_COOCCURRENCES)?

Best,
Sebastian

2012/7/12 Pat Ferrel <[email protected]>:

Thanks, actually there are two machines. I am testing before spending on
AWS. It's failing the test in this case.

BTW I ran the same setup with 150,000 docs and 250,000 terms with a much
lower timeout (30000000) all worked fine. I was using 0.6 at the time and
not sure if 0.8 has ever completed a rowsimilarity of any size. Small runs
work fine on my laptop.

I smell some kind of other problem than simple performance. In any case in a
perfect world isn't the code supposed to check in often enough so the
cluster config doesn't need to be tweaked for a specific job?

It may be some problem of mine, of course. I see no obvious hadoop or mahout
errors but there are many places to look.

With a 100 minute timeout I am currently at the pause between map and
reduce. If it fails would you like any specific logs?


On 7/11/12 4:00 PM, Sebastian Schelter wrote:

To be honest, I don't think it makes a lot of sense to test a Hadoop
job on a single machine. It's pretty obvious that you will get
terrible performance.

2012/7/12 Pat Ferrel <[email protected]>:

BTW the timeout is 1800 but the task in total runs over 9 hours before
each
failure. This causes the job to take (after three tries) 27 hrs to
completely fail. Oh, bother...

The timeout seems to be during the last map, so when the mappers reach
100%
but still running. Maybe some kind of cleanup is happening?
The first reducer is still "pending". The reducer never gets a chance to
start.

12/07/11 11:09:45 INFO mapred.JobClient:  map 92% reduce 0%
12/07/11 11:11:06 INFO mapred.JobClient:  map 93% reduce 0%
12/07/11 11:12:51 INFO mapred.JobClient:  map 94% reduce 0%
12/07/11 11:15:22 INFO mapred.JobClient:  map 95% reduce 0%
12/07/11 11:18:43 INFO mapred.JobClient:  map 96% reduce 0%
12/07/11 11:24:32 INFO mapred.JobClient:  map 97% reduce 0%
12/07/11 11:27:40 INFO mapred.JobClient:  map 98% reduce 0%
12/07/11 11:30:53 INFO mapred.JobClient:  map 99% reduce 0%
12/07/11 11:36:35 INFO mapred.JobClient:  map 100% reduce 0%
---after a very long wait (9hrs or so) insert fail here--->

8 core 2 machine cluster with 8G ram per machine 32,000 docs 76,000 terms

Any other info you need please ask.

I'm about to try cranking it up to a couple hours for timeout but I
suspect
there is something else going on here.


On 7/11/12 10:35 AM, Pat Ferrel wrote:

I'm have a custom lucene stemming analyzer that filters out stop words
and
uses the following seq2sparse. The -x 40 is the only other thing that
affects tossing frequent terms and as I understand things, tosses any
term
that appears in over 40% of the docs.

mahout seq2sparse \
      -i b2/seqfiles/ \
      -o b2/vectors/ \
      -ow \
      -chunk 2000 \
      -x 40 \
      -seq \
      -n 2 \
      -nv \
      -a com.finderbots.analyzers.LuceneStemmingAnalyzer


On 7/11/12 9:18 AM, Sebastian Schelter wrote:

Hi Pat,

have you removed highly frequent terms before launching rowsimilarity
job?

On 11.07.2012 18:14, Pat Ferrel wrote:

I've been trying to get a rowsimilarity job to complete. It continues
to
timeout on a RowSimilarityJob-CooccurrencesMapper-Reducer task so I've
upped the timeout to 30 minutes now. There are no errors in the logs
that I can see and no other task I've tried is acting like this. Is
this
expected? Shouldn't the task check in more often?

It's doing 34,000 docs with 40 sim docs each on 8 cores so it is a bit
slow anyway, still I shouldn't have to turn up the timeout so high
should I?

Re: RowSimilarity error

Reply via email to