RE: WWW wide crawling using nutch

Markus Jelsma Mon, 13 Aug 2012 12:36:43 -0700

Hi,

I won't try to estimate the size of the public internet but i may have some 
useful figures. A standard dual core machine with 2GB RAM can process about 15 
records per second in ideal conditions with a parsing fetcher without storing 
content. But this doesn't include indexing time, webgraph building or linkrank 
calculation. So we could achieve only about 10 records per second on average.


Another cluster with 16 cores and 16GB RAM each gives much better results so 
not only more hardware is better but more powerful hardware as well. With it we 
could, in ideal conditions, fetch and parse about 500 records per machine per 
second. When taking the other jobs into account it drops to an average of 300 
records per second per machine.

Under normal conditions it is between 150 and 250. With these figures you would 
have only a fraction of the internet after a year and not revisiting pages, 
even if you have a hunderd powerful machines.

It's also impossible to do with a standard Nutch as you will quickly run into a 
lot of trouble with useless pages and crawler traps. Another very significant 
problem is duplicate websites such as www and non-www pages but these 
duplicates come in many more exotic varieties. You also have to manage 
extremely large black lists (many millions) of dead hosts. You need to prevent 
those from polluting your CrawlDB, dead URL's can quickly grow very large.

Crawling the internet means managing a lot of crap.

Good luck
Markus
 
-----Original message-----
> From:Ryan L. Sun <[email protected]>
> Sent: Mon 13-Aug-2012 20:58
> To: [email protected]
> Subject: WWW wide crawling using nutch
> 
> Hi all,
> 
> I'm looking for some estimate/stat regarding WWW wide crawling using
> nutch (or 10%/20% of WWW). What kind of hardware do u need and how
> long it takes to finish one round of search?
> 
> TIA.
>

RE: WWW wide crawling using nutch

Reply via email to