parse hangs when trying to parse large files

[email protected] Thu, 02 Aug 2012 03:34:24 -0700

Hi,
I'm crawling a set of urls. when I reach the parse phase, the parsing of
most of the urls is done fast, but some large urls take longer.
I also have parser.timeout parameter set to 180 secs.
I see in the log that parsing of some of the urls fails because of
ParseTimeout, And I'm fine with it. My intention is to deal with these files
later.
The problem is that it seems that the map process which processes this url
was not killed.
I see a map process still running after the parsing has allegedly finished
its work.
the process takes a significant amount of memory and cpu.
when i run jstack on the process, I see that it has a thread of state=IN_VM
with tika stacktrace.



After running many parse jobs on different segments, I find myself with a
cluster full of hanged parsing processes.


Is this a known issue?

thanks.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/parse-hangs-when-trying-to-parse-large-files-tp3998771.html
Sent from the Nutch - User mailing list archive at Nabble.com.

parse hangs when trying to parse large files

Reply via email to