Julien, > the fetch step is likely to take most of the time and the time it takes it > mostly a matter of the distribution of hosts/IP/domains in your fetchlist. > Search the WIKI for details on performance tips
Thanks. Most of the urls that I'm fetching are each on their own IP/hosts and unique servers. > > >> * Should I be moving away from 'bin/nutch crawl -depth 8 -topN 10000' >> and calling each step individually? Why? >> > > This has been discussed several times on the mailing list : you get more > control with a script + all in one crawl command can have issues with > runaway parsing threads, etc... Understood. > > >> * Are there more recent/improved versions of >> http://wiki.apache.org/nutch/Crawl scripts that are written for Nutch >> 2.x? >> > > yes, see patch in https://issues.apache.org/jira/browse/NUTCH-1087 Thanks. I'll review that. > > >> * Why would Hbase show 64,000 documents but ElasticSearch only 50,000? >> > > redirections? sounds quite a lot though Thoughts for how I would identify which are redirects? > > HTH > > J > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble

