Hi
> * Given my current setup, is the crawl that I'm performing taking > roughly the same time that others might expect? > * If this crawl is taking much longer than you might expect what would > you suggest trying to decrease the crawl time? > the fetch step is likely to take most of the time and the time it takes it mostly a matter of the distribution of hosts/IP/domains in your fetchlist. Search the WIKI for details on performance tips > * Should I be moving away from 'bin/nutch crawl -depth 8 -topN 10000' > and calling each step individually? Why? > This has been discussed several times on the mailing list : you get more control with a script + all in one crawl command can have issues with runaway parsing threads, etc... > * Are there more recent/improved versions of > http://wiki.apache.org/nutch/Crawl scripts that are written for Nutch > 2.x? > yes, see patch in https://issues.apache.org/jira/browse/NUTCH-1087 > * Why would Hbase show 64,000 documents but ElasticSearch only 50,000? > redirections? sounds quite a lot though HTH J -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

