Hello,
using the default crawl script (runtime/local/bin/crawl) the parser will
crash trying to create a new thread after parsing slightly more than
5000 documents.
This only happens if the number of documents to crawl (generate -topN)
is set to > 5000.
Monitoring the number of threads created by the nutch java process: it
increases to about 5700 before the crash occurs.
I thought that the parser would not create that many threads in the
first place. Is this a bug/misconfiguration? Ist there any way to limit
the number of threads explicitly for parsing?
I found this thread and it is recommended to decrease the number of urls
(topN):
http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html
Is this the only possible solution? Older nutch versions did not have
this problem.
Parameters:
---------------
numSlaves=1
numTasks=`expr $numSlaves \* 2`
commonOptions="-D mapred.reduce.tasks=$numTasks -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true"
skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D
mapred.skip.map.max.skip.records=1"
$bin/nutch parse $commonOptions $skipRecordsOptions
$CRAWL_PATH/segments/$SEGMENT
hadoop.log
----------------
2013-10-18 14:57:28,294 INFO parse.ParseSegment - Parsed
(0ms):http://www....
2013-10-18 14:57:28,301 WARN mapred.LocalJobRunner -
job_local613646134_0001
java.lang.Exception: java.lang.OutOfMemoryError: unable to create new
native thread
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:640)
at
java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
at
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:657)
at
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
-----------------
Any help (especially information) is appreciated.
Sybille