Hello,

using the default crawl script (runtime/local/bin/crawl) the parser will crash trying to create a new thread after parsing slightly more than 5000 documents.

This only happens if the number of documents to crawl (generate -topN) is set to > 5000.

Monitoring the number of threads created by the nutch java process: it increases to about 5700 before the crash occurs.

I thought that the parser would not create that many threads in the first place. Is this a bug/misconfiguration? Ist there any way to limit the number of threads explicitly for parsing?

I found this thread and it is recommended to decrease the number of urls (topN): http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html

Is this the only possible solution? Older nutch versions did not have this problem.

Parameters:
---------------
numSlaves=1
numTasks=`expr $numSlaves \* 2`
commonOptions="-D mapred.reduce.tasks=$numTasks -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true" skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1"

$bin/nutch parse $commonOptions $skipRecordsOptions $CRAWL_PATH/segments/$SEGMENT

hadoop.log
----------------

2013-10-18 14:57:28,294 INFO parse.ParseSegment - Parsed (0ms):http://www.... 2013-10-18 14:57:28,301 WARN mapred.LocalJobRunner - job_local613646134_0001 java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native thread at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
    at java.lang.Thread.start0(Native Method)
    at java.lang.Thread.start(Thread.java:640)
at java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:657) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
    at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
    at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
    at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
    at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)
-----------------

Any help (especially information) is appreciated.

Sybille


Reply via email to