I'm not using the all-in-one crawl command. I've changed a bit nutch's code so it could allow me to run each step in a loop. I'm running nutch for indexing a filesystem that is accessed by a web server (IIS). Thats why it seemed more efficient to first collect all the file paths in the filesystem, convert them to a urlslist, and run nutch on this url list where all urls are actually files. When I got this url list, I injected it and created a new crawldb. Then, I run generator in order to generate X segments with Y urls in each segment. Then I loop all segments in order to fetch the urls, and finally I loop all segments in order to parse them.
But that's all just for the background. I encounter the problem even if I run a single parse command on a single segment. If one of the parsing tasks timeout, the process keeps running and I must kill it manually, or it will keep eating memory and CPU. Of course, killing manually is not possible when I run nightly these parse tasks one after another. Any way to solve this issue? -- View this message in context: http://lucene.472066.n3.nabble.com/parse-hangs-when-trying-to-parse-large-files-tp3998771p3998785.html Sent from the Nutch - User mailing list archive at Nabble.com.

