Re: Scalability for one site

2009-11-16 Thread Alex McLintock
2009/11/16 Mark Kerzner markkerz...@gmail.com:
 Hi,

 I want to politely crawl a site with 1-2 million pages. With the speed of
 about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop,
 and can I coordinate the crawlers so as not to cause a DOS attack?

Nutch basically uses hadoop - or an older version of hadoop. So yes -
it can run on a hadoop style cluster.

I *think* the way it is split up will only put one site on one node,
leaving you back at square one.

However I would say that 1 second per fetch is quite polite and any
faster is a bit rude. So I fail to see what you gain by using multiple
machines...




 I know that URLs from one domain as assigned to one fetch segment, and
 polite crawling is enforced. Should I use lower-level parts of Nutch?

Do you own the site being crawled?


Re: Scalability for one site

2009-11-16 Thread Mark Kerzner
Alex,

Thank you for the answer. As for your last question - no, I don't own that
site. I am looking for specific information type, and that is the first site
I want to crawl.

Mark

On Mon, Nov 16, 2009 at 1:54 PM, Alex McLintock alex.mclint...@gmail.comwrote:

 2009/11/16 Mark Kerzner markkerz...@gmail.com:
  Hi,
 
  I want to politely crawl a site with 1-2 million pages. With the speed of
  about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on
 Hadoop,
  and can I coordinate the crawlers so as not to cause a DOS attack?

 Nutch basically uses hadoop - or an older version of hadoop. So yes -
 it can run on a hadoop style cluster.

 I *think* the way it is split up will only put one site on one node,
 leaving you back at square one.

 However I would say that 1 second per fetch is quite polite and any
 faster is a bit rude. So I fail to see what you gain by using multiple
 machines...




  I know that URLs from one domain as assigned to one fetch segment, and
  polite crawling is enforced. Should I use lower-level parts of Nutch?

 Do you own the site being crawled?



Re: Scalability for one site

2009-11-16 Thread Andrzej Bialecki

Mark Kerzner wrote:

Hi,

I want to politely crawl a site with 1-2 million pages. With the speed of
about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop,
and can I coordinate the crawlers so as not to cause a DOS attack?


Your Hadoop cluster does not increase the scalability of the target 
server and that's the crux of the matter - whether you use Hadoop or 
not, multiple threads or a single thread, if you want to be polite you 
will be able to do just 1 req/sec and that's it.


You can prioritize certain pages for fetching so that you get the most 
interesting pages first (whatever interesting means).



I know that URLs from one domain as assigned to one fetch segment, and
polite crawling is enforced. Should I use lower-level parts of Nutch?


The built-in limits are there to avoid causing pain for inexperienced 
search engine operators (and webmasters who are their victims). The 
source code is there, if you choose you can modify it to bypass these 
restrictions, just be aware of the consequences (and don't use Nutch 
as your user agent ;) ).


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Scalability for one site

2009-11-16 Thread Mark Kerzner
ROFL

Thank you very much, Andrzej

On Mon, Nov 16, 2009 at 2:07 PM, Andrzej Bialecki a...@getopt.org wrote:

 Mark Kerzner wrote:

 Hi,

 I want to politely crawl a site with 1-2 million pages. With the speed of
 about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on
 Hadoop,
 and can I coordinate the crawlers so as not to cause a DOS attack?


 Your Hadoop cluster does not increase the scalability of the target server
 and that's the crux of the matter - whether you use Hadoop or not, multiple
 threads or a single thread, if you want to be polite you will be able to do
 just 1 req/sec and that's it.

 You can prioritize certain pages for fetching so that you get the most
 interesting pages first (whatever interesting means).


  I know that URLs from one domain as assigned to one fetch segment, and
 polite crawling is enforced. Should I use lower-level parts of Nutch?


 The built-in limits are there to avoid causing pain for inexperienced
 search engine operators (and webmasters who are their victims). The source
 code is there, if you choose you can modify it to bypass these restrictions,
 just be aware of the consequences (and don't use Nutch as your user agent
 ;) ).

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com