Hi,
I want to politely crawl a site with 1-2 million pages. With the speed of
about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop,
and can I coordinate the crawlers so as not to cause a DOS attack?
I know that URLs from one domain as assigned to one fetch segment, and
2009/11/16 Mark Kerzner markkerz...@gmail.com:
Hi,
I want to politely crawl a site with 1-2 million pages. With the speed of
about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop,
and can I coordinate the crawlers so as not to cause a DOS attack?
Nutch basically uses
Alex,
Thank you for the answer. As for your last question - no, I don't own that
site. I am looking for specific information type, and that is the first site
I want to crawl.
Mark
On Mon, Nov 16, 2009 at 1:54 PM, Alex McLintock alex.mclint...@gmail.comwrote:
2009/11/16 Mark Kerzner
Mark Kerzner wrote:
Hi,
I want to politely crawl a site with 1-2 million pages. With the speed of
about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop,
and can I coordinate the crawlers so as not to cause a DOS attack?
Your Hadoop cluster does not increase the
ROFL
Thank you very much, Andrzej
On Mon, Nov 16, 2009 at 2:07 PM, Andrzej Bialecki a...@getopt.org wrote:
Mark Kerzner wrote:
Hi,
I want to politely crawl a site with 1-2 million pages. With the speed of
about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on
Hadoop,
and