Hello Sachin,
You might want to check out the fetcher.* settings in your configuration. They
control how many threads in total, how they are queued, what the delay between
fetchers is, how many threads per queue etc.
Keep in mind, if you do not own the server or have no explicit permission, it
OK understood.
I am using nutch defaults and they are set optimally especially for polite
crawling.
I am indeed right now crawling just one host and given the defaults the
throughput is what it should be.
Yes one need not to be aggressive here and just be patient.
I think no where in near future
Hi Sachin,
> What I have observed is that it usually fetches, parses and indexes
> 1800 web pages.
This means 10 pages per minute.
How are the 1800 pages distributed over hosts?
The default delay between successive fetches to the same host is
5 seconds. If all pages belong to the same host,
3 matches
Mail list logo