Optimizing crawling for small number of domains/sites (aka. intranet crawling)

webdev1977 Tue, 06 Mar 2012 12:03:32 -0800

Is there a guide to optmizing nutch/hadoop for crawling intranet sites? 

Most of what I need to crawl are large stores of data (databases exposed
through html), share drive content, etc.  I have a very very small number of
"sites" to crawl (two dbs and one share drive).  The file share crawling is
PAINFULY slow. I am reading the code as we speak and trying to figure out
why the protocol-file plugin is so slow. Based on the following entry in the
wiki I don't think i am going to be able to increase the fetching time
because I am crawling just a few sites.


>From http://wiki.apache.org/nutch/OptimizingCrawls :

"Fetching a lot of pages from a single site or a lot of pages from a few
sites will slow down fetching dramatically. For full web crawls you want an
even distribution so all fetching threads can be active. Setting
generate.max.per.host to a value > 0 will limit the number of pages from a
single host/domain to fetch."

Could code changes or property changes help speed things up? If so, could
someone give me a hint?

Thanks!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Optimizing-crawling-for-small-number-of-domains-sites-aka-intranet-crawling-tp3804830p3804830.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Optimizing crawling for small number of domains/sites (aka. intranet crawling)

Reply via email to