Is there a guide to optmizing nutch/hadoop for crawling intranet sites? Most of what I need to crawl are large stores of data (databases exposed through html), share drive content, etc. I have a very very small number of "sites" to crawl (two dbs and one share drive). The file share crawling is PAINFULY slow. I am reading the code as we speak and trying to figure out why the protocol-file plugin is so slow. Based on the following entry in the wiki I don't think i am going to be able to increase the fetching time because I am crawling just a few sites.
>From http://wiki.apache.org/nutch/OptimizingCrawls : "Fetching a lot of pages from a single site or a lot of pages from a few sites will slow down fetching dramatically. For full web crawls you want an even distribution so all fetching threads can be active. Setting generate.max.per.host to a value > 0 will limit the number of pages from a single host/domain to fetch." Could code changes or property changes help speed things up? If so, could someone give me a hint? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Optimizing-crawling-for-small-number-of-domains-sites-aka-intranet-crawling-tp3804830p3804830.html Sent from the Nutch - User mailing list archive at Nabble.com.