Hi Sybille The threads spawned by the parser should be reclaimed once a page has been parsed. The parsing itself is not multi-threaded, so it would mean that something is preventing the threads to be deleted, or maybe as the error suggests you are running out of memory.
Do you specify parser.timeout in nutch-site.xml? Are you using any custom HTMLParsingFilter? The number of docs should not affect the memory. The parser runs on one document after the other so that would indicate a leak. There was a related issue not very long ago https://issues.apache.org/jira/browse/NUTCH-1640. Can you patch your code accordingly or use the trunk? I never got to the bottom of it but I am wondering whether this would fix the issue. Thanks Julien On 18 October 2013 14:32, Sybille Peters <[email protected]>wrote: > Hello, > > using the default crawl script (runtime/local/bin/crawl) the parser will > crash trying to create a new thread after parsing slightly more than 5000 > documents. > > This only happens if the number of documents to crawl (generate -topN) is > set to > 5000. > > Monitoring the number of threads created by the nutch java process: it > increases to about 5700 before the crash occurs. > > I thought that the parser would not create that many threads in the first > place. Is this a bug/misconfiguration? Ist there any way to limit the > number of threads explicitly for parsing? > > I found this thread and it is recommended to decrease the number of urls > (topN): http://lucene.472066.n3.**nabble.com/Nutch-1-6-java-** > lang-OutOfMemoryError-unable-**to-create-new-native-thread-** > td4044231.html<http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html> > > Is this the only possible solution? Older nutch versions did not have this > problem. > > Parameters: > --------------- > numSlaves=1 > numTasks=`expr $numSlaves \* 2` > commonOptions="-D mapred.reduce.tasks=$numTasks -D mapred.child.java.opts=- > **Xmx1000m -D mapred.reduce.tasks.**speculative.execution=false -D > mapred.map.tasks.speculative.**execution=false -D > mapred.compress.map.output=**true" > skipRecordsOptions="-D mapred.skip.attempts.to.start.**skipping=2 -D > mapred.skip.map.max.skip.**records=1" > > $bin/nutch parse $commonOptions $skipRecordsOptions > $CRAWL_PATH/segments/$SEGMENT > > hadoop.log > ---------------- > > 2013-10-18 14:57:28,294 INFO parse.ParseSegment - Parsed (0ms): > http://www.... > 2013-10-18 14:57:28,301 WARN mapred.LocalJobRunner - > job_local613646134_0001 > java.lang.Exception: java.lang.OutOfMemoryError: unable to create new > native thread > at org.apache.hadoop.mapred.**LocalJobRunner$Job.run(** > LocalJobRunner.java:354) > Caused by: java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.**java:640) > at java.util.concurrent.**ThreadPoolExecutor.** > addIfUnderMaximumPoolSize(**ThreadPoolExecutor.java:727) > at java.util.concurrent.**ThreadPoolExecutor.execute(** > ThreadPoolExecutor.java:657) > at java.util.concurrent.**AbstractExecutorService.**submit(** > AbstractExecutorService.java:**92) > at org.apache.nutch.parse.**ParseUtil.runParser(ParseUtil.**java:159) > at org.apache.nutch.parse.**ParseUtil.parse(ParseUtil.**java:93) > at org.apache.nutch.parse.**ParseSegment.map(ParseSegment.**java:97) > at org.apache.nutch.parse.**ParseSegment.map(ParseSegment.**java:44) > at org.apache.hadoop.mapred.**MapRunner.run(MapRunner.java:**50) > at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**java:430) > at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:366) > at org.apache.hadoop.mapred.**LocalJobRunner$Job$** > MapTaskRunnable.run(**LocalJobRunner.java:223) > at java.util.concurrent.**Executors$RunnableAdapter.** > call(Executors.java:441) > at java.util.concurrent.**FutureTask$Sync.innerRun(** > FutureTask.java:303) > at java.util.concurrent.**FutureTask.run(FutureTask.**java:138) > at java.util.concurrent.**ThreadPoolExecutor$Worker.** > runTask(ThreadPoolExecutor.**java:886) > at java.util.concurrent.**ThreadPoolExecutor$Worker.run(** > ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.**java:662) > ----------------- > > Any help (especially information) is appreciated. > > Sybille > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

