Hi

> * Given my current setup, is the crawl that I'm performing taking
> roughly the same time that others might expect?
> * If this crawl is taking much longer than you might expect what would
> you suggest trying to decrease the crawl time?
>

the fetch step is likely to take most of the time and the time it takes it
mostly a matter of the distribution of hosts/IP/domains in your fetchlist.
Search the WIKI for details on performance tips


> * Should I be moving away from 'bin/nutch crawl -depth 8 -topN 10000'
> and calling each step individually? Why?
>

This has been discussed several times on the mailing list : you get more
control with a script + all in one crawl command can have issues with
runaway parsing threads, etc...


> * Are there more recent/improved versions of
> http://wiki.apache.org/nutch/Crawl scripts that are written for Nutch
> 2.x?
>

yes, see patch in https://issues.apache.org/jira/browse/NUTCH-1087


> * Why would Hbase show 64,000 documents but ElasticSearch only 50,000?
>

redirections? sounds quite a lot though

HTH

J



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to