Scalability for one site

2009-11-16 Thread Mark Kerzner
Hi, I want to politely crawl a site with 1-2 million pages. With the speed of about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop, and can I coordinate the crawlers so as not to cause a DOS attack? I know that URLs from one domain as assigned to one fetch segment, and

Re: Scalability for one site

2009-11-16 Thread Alex McLintock
2009/11/16 Mark Kerzner markkerz...@gmail.com: Hi, I want to politely crawl a site with 1-2 million pages. With the speed of about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop, and can I coordinate the crawlers so as not to cause a DOS attack? Nutch basically uses

Re: Scalability for one site

2009-11-16 Thread Mark Kerzner
Alex, Thank you for the answer. As for your last question - no, I don't own that site. I am looking for specific information type, and that is the first site I want to crawl. Mark On Mon, Nov 16, 2009 at 1:54 PM, Alex McLintock alex.mclint...@gmail.comwrote: 2009/11/16 Mark Kerzner

Re: Scalability for one site

2009-11-16 Thread Andrzej Bialecki
Mark Kerzner wrote: Hi, I want to politely crawl a site with 1-2 million pages. With the speed of about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop, and can I coordinate the crawlers so as not to cause a DOS attack? Your Hadoop cluster does not increase the

Re: Scalability for one site

2009-11-16 Thread Mark Kerzner
ROFL Thank you very much, Andrzej On Mon, Nov 16, 2009 at 2:07 PM, Andrzej Bialecki a...@getopt.org wrote: Mark Kerzner wrote: Hi, I want to politely crawl a site with 1-2 million pages. With the speed of about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop, and