Re: parse hangs when trying to parse large files

[email protected] Thu, 02 Aug 2012 05:09:35 -0700

I'm not using the all-in-one crawl command.
I've changed a bit nutch's code so it could allow me to run each step in a
loop.
I'm running nutch for indexing a filesystem that is accessed by a web server
(IIS). 
Thats why it seemed more efficient to first collect all the file paths in
the filesystem, convert them to a urlslist, and run nutch on this url list
where all urls are actually files.
When I got this url list, I injected it and created a new crawldb. Then, I
run generator in order to generate X segments with Y urls in each segment.
Then I loop all segments in order to fetch the urls, and finally I loop all
segments in order to parse them.


But that's all just for the background. I encounter the problem even if I
run a single parse command on a single segment.  If one of the parsing tasks
timeout, the process keeps running and I must kill it manually, or it will
keep eating memory and CPU.
Of course, killing manually is not possible when I run nightly these parse
tasks one after another.

Any way to solve this issue?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/parse-hangs-when-trying-to-parse-large-files-tp3998771p3998785.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: parse hangs when trying to parse large files

Reply via email to