Hi Julien,
thanks for the hints. I have a reproducable testcase that fails every
time. Applying the ParserSegment patch did not help, unfortunately. The
parser.timeout is set to the default of 30 seconds. I reduced this
value, but it does not really help. The threads are created very fast
(parsing output shows a parse time of 0ms for most). The thread count of
over 5000 is reached in about 50 seconds. It seems the threads are not
closed down at all.
I already commented out most custom and extra plugins.
nutch-site.xml:
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr||query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
Even if there is some bug in a parse filter (infinite loop), shouldn't
the parsing stop instead of creating threads like crazy?
I cannot completely rule out some misconfiguration or error on my end.
Might be interesting to try to reproduce this with a fresh, unmodified
version of nutch 1.7.
Sybille
On 18.10.2013 15:50, Julien Nioche wrote:
Hi Sybille
The threads spawned by the parser should be reclaimed once a page has been
parsed. The parsing itself is not multi-threaded, so it would mean that
something is preventing the threads to be deleted, or maybe as the error
suggests you are running out of memory.
Do you specify parser.timeout in nutch-site.xml? Are you using any custom
HTMLParsingFilter?
The number of docs should not affect the memory. The parser runs on one
document after the other so that would indicate a leak. There was a related
issue not very long ago https://issues.apache.org/jira/browse/NUTCH-1640.
Can you patch your code accordingly or use the trunk? I never got to the
bottom of it but I am wondering whether this would fix the issue.
Thanks
Julien
On 18 October 2013 14:32, Sybille Peters <[email protected]>wrote:
Hello,
using the default crawl script (runtime/local/bin/crawl) the parser will
crash trying to create a new thread after parsing slightly more than 5000
documents.
This only happens if the number of documents to crawl (generate -topN) is
set to > 5000.
Monitoring the number of threads created by the nutch java process: it
increases to about 5700 before the crash occurs.
I thought that the parser would not create that many threads in the first
place. Is this a bug/misconfiguration? Ist there any way to limit the
number of threads explicitly for parsing?
I found this thread and it is recommended to decrease the number of urls
(topN): http://lucene.472066.n3.**nabble.com/Nutch-1-6-java-**
lang-OutOfMemoryError-unable-**to-create-new-native-thread-**
td4044231.html<http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html>
Is this the only possible solution? Older nutch versions did not have this
problem.
Parameters:
---------------
numSlaves=1
numTasks=`expr $numSlaves \* 2`
commonOptions="-D mapred.reduce.tasks=$numTasks -D mapred.child.java.opts=-
**Xmx1000m -D mapred.reduce.tasks.**speculative.execution=false -D
mapred.map.tasks.speculative.**execution=false -D
mapred.compress.map.output=**true"
skipRecordsOptions="-D mapred.skip.attempts.to.start.**skipping=2 -D
mapred.skip.map.max.skip.**records=1"
$bin/nutch parse $commonOptions $skipRecordsOptions
$CRAWL_PATH/segments/$SEGMENT
hadoop.log
----------------
2013-10-18 14:57:28,294 INFO parse.ParseSegment - Parsed (0ms):
http://www....
2013-10-18 14:57:28,301 WARN mapred.LocalJobRunner -
job_local613646134_0001
java.lang.Exception: java.lang.OutOfMemoryError: unable to create new
native thread
at org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
LocalJobRunner.java:354)
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.**java:640)
at java.util.concurrent.**ThreadPoolExecutor.**
addIfUnderMaximumPoolSize(**ThreadPoolExecutor.java:727)
at java.util.concurrent.**ThreadPoolExecutor.execute(**
ThreadPoolExecutor.java:657)
at java.util.concurrent.**AbstractExecutorService.**submit(**
AbstractExecutorService.java:**92)
at org.apache.nutch.parse.**ParseUtil.runParser(ParseUtil.**java:159)
at org.apache.nutch.parse.**ParseUtil.parse(ParseUtil.**java:93)
at org.apache.nutch.parse.**ParseSegment.map(ParseSegment.**java:97)
at org.apache.nutch.parse.**ParseSegment.map(ParseSegment.**java:44)
at org.apache.hadoop.mapred.**MapRunner.run(MapRunner.java:**50)
at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**java:430)
at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:366)
at org.apache.hadoop.mapred.**LocalJobRunner$Job$**
MapTaskRunnable.run(**LocalJobRunner.java:223)
at java.util.concurrent.**Executors$RunnableAdapter.**
call(Executors.java:441)
at java.util.concurrent.**FutureTask$Sync.innerRun(**
FutureTask.java:303)
at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
at java.util.concurrent.**ThreadPoolExecutor$Worker.**
runTask(ThreadPoolExecutor.**java:886)
at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**
ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.**java:662)
-----------------
Any help (especially information) is appreciated.
Sybille
--
Diplom-Informatikerin (FH) Sybille Peters
Leibniz Universität IT Services (ehemals RRZN)
Schloßwender Straße 5, 30159 Hannover
Tel.: +49 511 762 793280
Email: [email protected]
http://www.rrzn.uni-hannover.de
TYPO3@RRZN
TYPO3-Team Leibniz Universität IT Services (ehemals RRZN)
Email: [email protected]
http://www.t3luh.rrzn.uni-hannover.de