it is a known issue for the all-in-one crawl command : the parsing threads
cannot be killed and will live on. It is recommended to use the shell
script in trunk as a replacement for the crawl command as the various Nutch
steps are called separately which means that these parsing threads will not
survive the end of the parse process

On 2 August 2012 11:33, [email protected] <[email protected]> wrote:

> Hi,
> I'm crawling a set of urls. when I reach the parse phase, the parsing of
> most of the urls is done fast, but some large urls take longer.
> I also have parser.timeout parameter set to 180 secs.
> I see in the log that parsing of some of the urls fails because of
> ParseTimeout, And I'm fine with it. My intention is to deal with these
> files
> later.
> The problem is that it seems that the map process which processes this url
> was not killed.
> I see a map process still running after the parsing has allegedly finished
> its work.
> the process takes a significant amount of memory and cpu.
> when i run jstack on the process, I see that it has a thread of state=IN_VM
> with tika stacktrace.
>
>
> After running many parse jobs on different segments, I find myself with a
> cluster full of hanged parsing processes.
>
>
> Is this a known issue?
>
> thanks.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/parse-hangs-when-trying-to-parse-large-files-tp3998771.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to