Re: Large Shared Drive Crawl

webdev1977 Tue, 28 Feb 2012 04:03:06 -0800

What is a reasonable number of threads?   What about memory?  Where is the
best place to set that in the nutch script? in one of the config files.


I abandoned using distributed mode (10 slaves), it was taking WAYYYYY to
long to crawl the web and share drives in my enterprise, not to mention I am
running entirely on a windows platform and I think that hadoop is having
some issues on the namenode (it shuts down after running for a few hours)

I get an OOM error during the fetch cycle:

java.lang.OutofMemoryError: java heap space
   at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.ajava:342)
   ....

This is after several file 404 errors (some directories and files are locked
down, hence the 404) as well as several java.lang.IllegalArgumentException:
URLDecoder: Illegal Hex characters in escape (%) pattern - For input string:
" G"





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Large-Shared-Drive-Crawl-tp3781917p3783800.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Large Shared Drive Crawl

Reply via email to