Hi Sybille,

thanks for the hints. I have a reproducable testcase that fails every time.
> Applying the ParserSegment patch did not help, unfortunately.The
> parser.timeout is set to the default of 30 seconds. I reduced this value,
> but it does not really help.

The threads are created very fast (parsing output shows a parse time of 0ms
> for most). The thread count of over 5000 is reached in about 50 seconds. It
> seems the threads are not closed down at all.


> I already commented out most custom and extra plugins.
>
> nutch-site.xml:
>  <name>plugin.includes</name>
> <value>protocol-http|**urlfilter-regex|parse-(html|**
> tika)|index-(basic|anchor)|**indexer-solr||query-(basic|**
> site|url)|response-(json|xml)|**summary-basic|scoring-opic|**
> urlnormalizer-(pass|regex|**basic)</value>
>
> Even if there is some bug in a parse filter (infinite loop), shouldn't the
> parsing stop instead of creating threads like crazy?
>

The purpose of using threads was actually to prevent the parsing of an
entire segment to fail (and possibly take a long time to reparse by
skipping the culprit). What it does is that the thread is actually not
reclaimed when it reaches a timeout (threads can't be stopped) but is left
hanging.  That's fine in most cases as there would be just a few timeouts
even on a large segment.
If you are getting many threads created, then it probably means that there
is something wrong with your parsing and that all your documents are
triggering a timeout.

One of the reasons why we marked the old crawl command as deprecated is
that the crawl cycles where running in the same JVM and that the parse
failures could accumulate over the lifetime of the crawl. This is not the
case when using the crawl script


>
> I cannot completely rule out some misconfiguration or error on my end.
> Might be interesting to try to reproduce this with a fresh, unmodified
> version of nutch 1.7.
>

Judging by your nutch-site.xml you must be using a very old version of
Nutch. Could you try to parse the segment which is giving you trouble with
the current trunk and if it happens there, please open a JIRA and attach a
zip of the segment so that we can reproduce the issue

Thanks

Julien




>
> Sybille
>
>
>
>
> On 18.10.2013 15:50, Julien Nioche wrote:
>
>> Hi Sybille
>>
>> The threads spawned by the parser should be reclaimed once a page has been
>> parsed. The parsing itself is not multi-threaded, so it would mean that
>> something is preventing the threads to be deleted, or maybe as the error
>> suggests you are running out of memory.
>>
>> Do you specify parser.timeout in nutch-site.xml? Are you using any custom
>> HTMLParsingFilter?
>>
>> The number of docs should not affect the memory. The parser runs on one
>> document after the other so that would indicate a leak. There was a
>> related
>> issue not very long ago https://issues.apache.org/**
>> jira/browse/NUTCH-1640 <https://issues.apache.org/jira/browse/NUTCH-1640>
>> .
>> Can you patch your code accordingly or use the trunk? I never got to the
>> bottom of it but I am wondering whether this would fix the issue.
>>
>> Thanks
>>
>> Julien
>>
>>
>> On 18 October 2013 14:32, Sybille Peters <[email protected]>**
>> wrote:
>>
>>  Hello,
>>>
>>> using the default crawl script (runtime/local/bin/crawl) the parser will
>>> crash trying to create a new thread after parsing slightly more than 5000
>>> documents.
>>>
>>> This only happens if the number of documents to crawl (generate -topN) is
>>> set to > 5000.
>>>
>>> Monitoring the number of threads created by the nutch java process: it
>>> increases to about 5700 before the crash occurs.
>>>
>>> I thought that the parser would not create that many threads in the first
>>> place. Is this a bug/misconfiguration? Ist there any way to limit the
>>> number of threads explicitly for parsing?
>>>
>>> I found this thread and it is recommended to decrease the number of urls
>>> (topN): 
>>> http://lucene.472066.n3.**nabb**le.com/Nutch-1-6-java-**<http://nabble.com/Nutch-1-6-java-**>
>>> lang-OutOfMemoryError-unable-****to-create-new-native-thread-****
>>> td4044231.html<http://lucene.**472066.n3.nabble.com/Nutch-1-**
>>> 6-java-lang-OutOfMemoryError-**unable-to-create-new-native-**
>>> thread-td4044231.html<http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html>
>>> >
>>>
>>>
>>> Is this the only possible solution? Older nutch versions did not have
>>> this
>>> problem.
>>>
>>> Parameters:
>>> ---------------
>>> numSlaves=1
>>> numTasks=`expr $numSlaves \* 2`
>>> commonOptions="-D mapred.reduce.tasks=$numTasks -D
>>> mapred.child.java.opts=-
>>> **Xmx1000m -D mapred.reduce.tasks.****speculative.execution=false -D
>>> mapred.map.tasks.speculative.****execution=false -D
>>> mapred.compress.map.output=****true"
>>> skipRecordsOptions="-D mapred.skip.attempts.to.start.****skipping=2 -D
>>> mapred.skip.map.max.skip.****records=1"
>>>
>>>
>>> $bin/nutch parse $commonOptions $skipRecordsOptions
>>> $CRAWL_PATH/segments/$SEGMENT
>>>
>>> hadoop.log
>>> ----------------
>>>
>>> 2013-10-18 14:57:28,294 INFO  parse.ParseSegment - Parsed (0ms):
>>> http://www....
>>> 2013-10-18 14:57:28,301 WARN  mapred.LocalJobRunner -
>>> job_local613646134_0001
>>> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new
>>> native thread
>>>      at org.apache.hadoop.mapred.****LocalJobRunner$Job.run(**
>>>
>>> LocalJobRunner.java:354)
>>> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
>>>      at java.lang.Thread.start0(Native Method)
>>>      at java.lang.Thread.start(Thread.****java:640)
>>>      at java.util.concurrent.****ThreadPoolExecutor.**
>>> addIfUnderMaximumPoolSize(****ThreadPoolExecutor.java:727)
>>>      at java.util.concurrent.****ThreadPoolExecutor.execute(**
>>> ThreadPoolExecutor.java:657)
>>>      at java.util.concurrent.****AbstractExecutorService.****submit(**
>>> AbstractExecutorService.java:****92)
>>>      at org.apache.nutch.parse.****ParseUtil.runParser(ParseUtil.**
>>> **java:159)
>>>      at org.apache.nutch.parse.****ParseUtil.parse(ParseUtil.****
>>> java:93)
>>>      at org.apache.nutch.parse.****ParseSegment.map(ParseSegment.**
>>> **java:97)
>>>      at org.apache.nutch.parse.****ParseSegment.map(ParseSegment.**
>>> **java:44)
>>>      at org.apache.hadoop.mapred.****MapRunner.run(MapRunner.java:***
>>> *50)
>>>      at org.apache.hadoop.mapred.****MapTask.runOldMapper(MapTask.***
>>> *java:430)
>>>      at org.apache.hadoop.mapred.****MapTask.run(MapTask.java:366)
>>>      at org.apache.hadoop.mapred.****LocalJobRunner$Job$**
>>> MapTaskRunnable.run(****LocalJobRunner.java:223)
>>>      at java.util.concurrent.****Executors$RunnableAdapter.**
>>> call(Executors.java:441)
>>>      at java.util.concurrent.****FutureTask$Sync.innerRun(**
>>> FutureTask.java:303)
>>>      at java.util.concurrent.****FutureTask.run(FutureTask.****java:138)
>>>      at java.util.concurrent.****ThreadPoolExecutor$Worker.**
>>> runTask(ThreadPoolExecutor.****java:886)
>>>      at java.util.concurrent.****ThreadPoolExecutor$Worker.run(****
>>> ThreadPoolExecutor.java:908)
>>>      at java.lang.Thread.run(Thread.****java:662)
>>>
>>> -----------------
>>>
>>> Any help (especially information) is appreciated.
>>>
>>> Sybille
>>>
>>>
>>>
>>>
>>
>
> --
> Diplom-Informatikerin (FH) Sybille Peters
> Leibniz Universität IT Services (ehemals RRZN)
> Schloßwender Straße 5, 30159 Hannover
> Tel.: +49 511 762 793280
> Email: [email protected]
> http://www.rrzn.uni-hannover.**de <http://www.rrzn.uni-hannover.de>
>
> TYPO3@RRZN
> TYPO3-Team Leibniz Universität IT Services (ehemals RRZN)
> Email: [email protected]
> http://www.t3luh.rrzn.uni-**hannover.de<http://www.t3luh.rrzn.uni-hannover.de>
>
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to