Hi Matt
> > the fetch step is likely to take most of the time and the time it takes > it > > mostly a matter of the distribution of hosts/IP/domains in your > fetchlist. > > Search the WIKI for details on performance tips > > Thanks. Most of the urls that I'm fetching are each on their own > IP/hosts and unique servers. > Ok, you might want to use a large number of threads then (fetcher.threads.fetch) [...] > > > > > > >> * Why would Hbase show 64,000 documents but ElasticSearch only 50,000? > >> > > > > redirections? sounds quite a lot though > > Thoughts for how I would identify which are redirects? > try using 'nutch readdb' to dump the content of the webtable and inspect the URLs Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

