Hi Julien,

thanks for the hints. I have a reproducable testcase that fails every time. Applying the ParserSegment patch did not help, unfortunately. The parser.timeout is set to the default of 30 seconds. I reduced this value, but it does not really help. The threads are created very fast (parsing output shows a parse time of 0ms for most). The thread count of over 5000 is reached in about 50 seconds. It seems the threads are not closed down at all.

I already commented out most custom and extra plugins.

nutch-site.xml:
 <name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr||query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

Even if there is some bug in a parse filter (infinite loop), shouldn't the parsing stop instead of creating threads like crazy?

I cannot completely rule out some misconfiguration or error on my end. Might be interesting to try to reproduce this with a fresh, unmodified version of nutch 1.7.

Sybille



On 18.10.2013 15:50, Julien Nioche wrote:
Hi Sybille

The threads spawned by the parser should be reclaimed once a page has been
parsed. The parsing itself is not multi-threaded, so it would mean that
something is preventing the threads to be deleted, or maybe as the error
suggests you are running out of memory.

Do you specify parser.timeout in nutch-site.xml? Are you using any custom
HTMLParsingFilter?

The number of docs should not affect the memory. The parser runs on one
document after the other so that would indicate a leak. There was a related
issue not very long ago https://issues.apache.org/jira/browse/NUTCH-1640.
Can you patch your code accordingly or use the trunk? I never got to the
bottom of it but I am wondering whether this would fix the issue.

Thanks

Julien


On 18 October 2013 14:32, Sybille Peters <[email protected]>wrote:

Hello,

using the default crawl script (runtime/local/bin/crawl) the parser will
crash trying to create a new thread after parsing slightly more than 5000
documents.

This only happens if the number of documents to crawl (generate -topN) is
set to > 5000.

Monitoring the number of threads created by the nutch java process: it
increases to about 5700 before the crash occurs.

I thought that the parser would not create that many threads in the first
place. Is this a bug/misconfiguration? Ist there any way to limit the
number of threads explicitly for parsing?

I found this thread and it is recommended to decrease the number of urls
(topN): http://lucene.472066.n3.**nabble.com/Nutch-1-6-java-**
lang-OutOfMemoryError-unable-**to-create-new-native-thread-**
td4044231.html<http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html>

Is this the only possible solution? Older nutch versions did not have this
problem.

Parameters:
---------------
numSlaves=1
numTasks=`expr $numSlaves \* 2`
commonOptions="-D mapred.reduce.tasks=$numTasks -D mapred.child.java.opts=-
**Xmx1000m -D mapred.reduce.tasks.**speculative.execution=false -D
mapred.map.tasks.speculative.**execution=false -D
mapred.compress.map.output=**true"
skipRecordsOptions="-D mapred.skip.attempts.to.start.**skipping=2 -D
mapred.skip.map.max.skip.**records=1"

$bin/nutch parse $commonOptions $skipRecordsOptions
$CRAWL_PATH/segments/$SEGMENT

hadoop.log
----------------

2013-10-18 14:57:28,294 INFO  parse.ParseSegment - Parsed (0ms):
http://www....
2013-10-18 14:57:28,301 WARN  mapred.LocalJobRunner -
job_local613646134_0001
java.lang.Exception: java.lang.OutOfMemoryError: unable to create new
native thread
     at org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
LocalJobRunner.java:354)
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
     at java.lang.Thread.start0(Native Method)
     at java.lang.Thread.start(Thread.**java:640)
     at java.util.concurrent.**ThreadPoolExecutor.**
addIfUnderMaximumPoolSize(**ThreadPoolExecutor.java:727)
     at java.util.concurrent.**ThreadPoolExecutor.execute(**
ThreadPoolExecutor.java:657)
     at java.util.concurrent.**AbstractExecutorService.**submit(**
AbstractExecutorService.java:**92)
     at org.apache.nutch.parse.**ParseUtil.runParser(ParseUtil.**java:159)
     at org.apache.nutch.parse.**ParseUtil.parse(ParseUtil.**java:93)
     at org.apache.nutch.parse.**ParseSegment.map(ParseSegment.**java:97)
     at org.apache.nutch.parse.**ParseSegment.map(ParseSegment.**java:44)
     at org.apache.hadoop.mapred.**MapRunner.run(MapRunner.java:**50)
     at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**java:430)
     at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:366)
     at org.apache.hadoop.mapred.**LocalJobRunner$Job$**
MapTaskRunnable.run(**LocalJobRunner.java:223)
     at java.util.concurrent.**Executors$RunnableAdapter.**
call(Executors.java:441)
     at java.util.concurrent.**FutureTask$Sync.innerRun(**
FutureTask.java:303)
     at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
     at java.util.concurrent.**ThreadPoolExecutor$Worker.**
runTask(ThreadPoolExecutor.**java:886)
     at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**
ThreadPoolExecutor.java:908)
     at java.lang.Thread.run(Thread.**java:662)
-----------------

Any help (especially information) is appreciated.

Sybille






--
Diplom-Informatikerin (FH) Sybille Peters
Leibniz Universität IT Services (ehemals RRZN)
Schloßwender Straße 5, 30159 Hannover
Tel.: +49 511 762 793280
Email: [email protected]
http://www.rrzn.uni-hannover.de

TYPO3@RRZN
TYPO3-Team Leibniz Universität IT Services (ehemals RRZN)
Email: [email protected]
http://www.t3luh.rrzn.uni-hannover.de

Reply via email to