Hi Matt

> > the fetch step is likely to take most of the time and the time it takes
> it
> > mostly a matter of the distribution of hosts/IP/domains in your
> fetchlist.
> > Search the WIKI for details on performance tips
>
> Thanks. Most of the urls that I'm fetching are each on their own
> IP/hosts and unique servers.
>

Ok, you might want to use a large number of threads then
(fetcher.threads.fetch)

[...]


>
> >
> >
> >> * Why would Hbase show 64,000 documents but ElasticSearch only 50,000?
> >>
> >
> > redirections? sounds quite a lot though
>
> Thoughts for how I would identify which are redirects?
>

try using 'nutch readdb' to dump the content of the webtable and inspect
the URLs

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to