What is a reasonable number of threads? What about memory? Where is the best place to set that in the nutch script? in one of the config files.
I abandoned using distributed mode (10 slaves), it was taking WAYYYYY to long to crawl the web and share drives in my enterprise, not to mention I am running entirely on a windows platform and I think that hadoop is having some issues on the namenode (it shuts down after running for a few hours) I get an OOM error during the fetch cycle: java.lang.OutofMemoryError: java heap space at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.ajava:342) .... This is after several file 404 errors (some directories and files are locked down, hence the 404) as well as several java.lang.IllegalArgumentException: URLDecoder: Illegal Hex characters in escape (%) pattern - For input string: " G" -- View this message in context: http://lucene.472066.n3.nabble.com/Large-Shared-Drive-Crawl-tp3781917p3783800.html Sent from the Nutch - User mailing list archive at Nabble.com.