Re: MergeSegments - map reduce thread death

2009-11-05 Thread fadzi
i tried this once but before i knew it my log file was approaching a gig
within an hour or so!


 I suggest maybe turning the debug logs on for hadoop before you do the
 next crawl... you can do this by editing log4j.properties
 and change the rootLogger from INFO to DEBUG

 On Thu, Nov 5, 2009 at 12:37 AM, Andrzej Bialecki a...@getopt.org wrote:
 fa...@butterflycluster.net wrote:

 Hi there,

 seems i have some serious problems with hadoop during map-reduce for
 MergeSegments.

 i am out of ideas on this. Any suggestions will be quite welcome.

 Here is my set up:

 RAM: 4G
 JVM HEAP: 2G
 mapred.child.java.opts = 1024M
 hadoop-0.19.1-core.jar
 nutch-1.0
 Xen VPS.

 After running a recrawl a few times; i end up with one segment that is
 relatively larger compared to the new ones last generated. here is my
 segments structure when things blow up after a (5th) recrawl;

 segment1 = 674Megs (after several recrawls)
 segment2 = 580k (last recrawl)
 segment3 = 568k (last recrawl)
 segment4 = 584k (last recrawl)
 ..
 segment8 = 560k (last recrawl)

 when i run mergeSegments everything goes well until we get up to 90% of
 the map-reduce and we get a thread death; here is a stack trace

 2009-11-05 10:54:16,874 INFO  [org.apache.hadoop.mapred.LocalJobRunner]
 reduce  reduce
 2009-11-05 10:54:29,794 INFO  [org.apache.hadoop.mapred.LocalJobRunner]
 reduce  reduce
 2009-11-05 10:54:55,194 INFO  [org.apache.hadoop.mapred.LocalJobRunner]
 reduce  reduce
 2009-11-05 10:57:25,844 WARN  [org.apache.hadoop.mapred.LocalJobRunner]
 job_local_0001
 java.lang.ThreadDeath
        at java.lang.Thread.stop(Thread.java:715)
        at
 org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310)
        at

 org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315)
        at
 org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1239)
        at
 org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:620)
        at
 org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:665)

 any suggestions please

 This is a high-level exception that doesn't indicate the nature of the
 original problem. Is there any other information in hadoop.log or in
 task
 logs (logs/userlogs)?

 In my experience this sort of things happen rarely, for the relatively
 small
 dataset that you have, so you are lucky ;) This could be related to a
 number
 of issues, like running this under Xen that imposes some limits and
 slowdowns, or you may have a low number of file descriptors (ulimit -n),
 or
 a faulty RAM, or an overheated CPU ...

 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com







Re: MergeSegments - map reduce thread death

2009-11-05 Thread fadzi
hi there,

we tried a few things around this; one suggestion was to run on it on a
local machine; so i pulled one of our decent servers and got to work...
but surprisingly we got the same error on a local machine!

so it seems the hardware (VPS/Local) wasnt the culprit.. probably the
data, or the code.

so we decided to discard the db and generate a new one - things seem to be
working normally so far.. but lets see when db becomes larger.

having said that - there were a few things we found out and need
clarification whether they were a cause for problems or not;

here is the scenario - in sequence of execution;

step 1 setup.

* first crawl was done using bin/nutch crawl..
- urls = 1500
- depth = 10
- topN = 500
(so it should do all by round 3 right? what happens at rounds 4 to 10?)

step 2 to 5 setup.

* recrawl (repeat)
- topN = 1
- depth = 10
- db.default.fetch.interval = 30 (doesnt seem to do anything)
- generate.update.crawldb = false (same fetchlist was being generated)
- injected seed urls again (bad! we didnt realise this was happening, but
whats the effect of doing this?)
- fetch
- update db
(this step above was an effort to get an incremental crawl.. )

step 6
* merge segments, invertlinks, indexes...
- at this stage map reduce just died during MergeSegments, ..with an out
of heap memory exception.

the assumption was with a seed url list of 1500, nutch will generate more
NEW urls from the crawldb based on the outlinks it found - is this true?
because it did not seem to be the case.

also what is the effect of running a recrawl with using topN more than
what nutch can generate?

 i tried this once but before i knew it my log file was approaching a gig
 within an hour or so!


 I suggest maybe turning the debug logs on for hadoop before you do the
 next crawl... you can do this by editing log4j.properties
 and change the rootLogger from INFO to DEBUG

 On Thu, Nov 5, 2009 at 12:37 AM, Andrzej Bialecki a...@getopt.org wrote:
 fa...@butterflycluster.net wrote:

 Hi there,

 seems i have some serious problems with hadoop during map-reduce for
 MergeSegments.

 i am out of ideas on this. Any suggestions will be quite welcome.

 Here is my set up:

 RAM: 4G
 JVM HEAP: 2G
 mapred.child.java.opts = 1024M
 hadoop-0.19.1-core.jar
 nutch-1.0
 Xen VPS.

 After running a recrawl a few times; i end up with one segment that is
 relatively larger compared to the new ones last generated. here is my
 segments structure when things blow up after a (5th) recrawl;

 segment1 = 674Megs (after several recrawls)
 segment2 = 580k (last recrawl)
 segment3 = 568k (last recrawl)
 segment4 = 584k (last recrawl)
 ..
 segment8 = 560k (last recrawl)

 when i run mergeSegments everything goes well until we get up to 90%
 of
 the map-reduce and we get a thread death; here is a stack trace

 2009-11-05 10:54:16,874 INFO
  [org.apache.hadoop.mapred.LocalJobRunner]
 reduce  reduce
 2009-11-05 10:54:29,794 INFO
  [org.apache.hadoop.mapred.LocalJobRunner]
 reduce  reduce
 2009-11-05 10:54:55,194 INFO
  [org.apache.hadoop.mapred.LocalJobRunner]
 reduce  reduce
 2009-11-05 10:57:25,844 WARN
  [org.apache.hadoop.mapred.LocalJobRunner]
 job_local_0001
 java.lang.ThreadDeath
        at java.lang.Thread.stop(Thread.java:715)
        at
 org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310)
        at

 org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315)
        at
 org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1239)
        at
 org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:620)
        at
 org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:665)

 any suggestions please

 This is a high-level exception that doesn't indicate the nature of the
 original problem. Is there any other information in hadoop.log or in
 task
 logs (logs/userlogs)?

 In my experience this sort of things happen rarely, for the relatively
 small
 dataset that you have, so you are lucky ;) This could be related to a
 number
 of issues, like running this under Xen that imposes some limits and
 slowdowns, or you may have a low number of file descriptors (ulimit
 -n),
 or
 a faulty RAM, or an overheated CPU ...

 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com