I am attempting to crawl a very large intranet file system using Nutch and I am having some issues. At one point in the crawl cycle I get an java heap space error during fetching. I think it is related to the number of URLS listed in the segment to be fetched. I do want to crawl/index EVERYTHING on this share drive, but I think the shear number of folders listed and files in some directories are hosing things up.
So my question is.. will changing topN to a small number allow me to eventually get all the urls in this shared drive (after many, many generate ->fetch ->parse->updatedb->invertlinks->solrindex cycles). I recently upgraded to 1.4 and I don't see the depth parameter any more? If it is still around, would that be a possible way to short the cycle which would keep the memory usage down during each cycle? Anything else I am missing? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Large-Shared-Drive-Crawl-tp3781917p3781917.html Sent from the Nutch - User mailing list archive at Nabble.com.