it is a known issue for the all-in-one crawl command : the parsing threads cannot be killed and will live on. It is recommended to use the shell script in trunk as a replacement for the crawl command as the various Nutch steps are called separately which means that these parsing threads will not survive the end of the parse process
On 2 August 2012 11:33, [email protected] <[email protected]> wrote: > Hi, > I'm crawling a set of urls. when I reach the parse phase, the parsing of > most of the urls is done fast, but some large urls take longer. > I also have parser.timeout parameter set to 180 secs. > I see in the log that parsing of some of the urls fails because of > ParseTimeout, And I'm fine with it. My intention is to deal with these > files > later. > The problem is that it seems that the map process which processes this url > was not killed. > I see a map process still running after the parsing has allegedly finished > its work. > the process takes a significant amount of memory and cpu. > when i run jstack on the process, I see that it has a thread of state=IN_VM > with tika stacktrace. > > > After running many parse jobs on different segments, I find myself with a > cluster full of hanged parsing processes. > > > Is this a known issue? > > thanks. > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/parse-hangs-when-trying-to-parse-large-files-tp3998771.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

