Large Shared Drive Crawl

webdev1977 Mon, 27 Feb 2012 12:06:37 -0800

I am attempting to crawl a very large intranet file system using Nutch and I
am having some issues.  At one point in the crawl cycle I get an java heap
space error during fetching.  I think it is related to the number of URLS
listed in the segment to be fetched.  I do want to crawl/index EVERYTHING on
this share drive, but I think the shear number of folders listed and files
in some directories are hosing things up.


So my question is.. will changing topN to a small number allow me to
eventually get all the urls in this shared drive (after many, many generate
->fetch ->parse->updatedb->invertlinks->solrindex cycles).  I recently
upgraded to 1.4 and I don't see the depth parameter any more? If it is still
around, would that be a possible way to short the cycle which would keep the
memory usage down during each cycle?  

Anything else I am missing? 

Thanks!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Large-Shared-Drive-Crawl-tp3781917p3781917.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Large Shared Drive Crawl

Reply via email to