OK understood.
I am using nutch defaults and they are set optimally especially for polite
crawling.
I am indeed right now crawling just one host and given the defaults the
throughput is what it should be.
Yes one need not to be aggressive here and just be patient.
I think no where in near future
Hi Sachin,
> What I have observed is that it usually fetches, parses and indexes
> 1800 web pages.
This means 10 pages per minute.
How are the 1800 pages distributed over hosts?
The default delay between successive fetches to the same host is
5 seconds. If all pages belong to the same host,
Hello Sachin,
You might want to check out the fetcher.* settings in your configuration. They
control how many threads in total, how they are queued, what the delay between
fetchers is, how many threads per queue etc.
Keep in mind, if you do not own the server or have no explicit permission, it
Hi,
I understood the point.
I would also like to run nutch on my local machine.
So far I am running in standalone mode with default crawl script where
fetch time limit is 180 minutes.
What I have observed is that it usually fetches, parses and indexes 1800
web pages.
I am basically fetching the
Hello Sachin,
Nutch can run on Amazon AWS without trouble, and probably on any Hadoop based
provider. This is the most expensive option you have.
Cheaper would be to rent some servers and install Hadoop yourself, getting it
up and running by hand on some servers will take the better part of a
Hi,
I have been running nutch in local mode and so far I am able to have a good
understanding on how it all works.
I wanted to start with distributed crawling using some public cloud
provider.
I just wanted to know if fellow users have any experience in setting up
nutch for distributed crawling.
6 matches
Mail list logo