Hi, I won't try to estimate the size of the public internet but i may have some useful figures. A standard dual core machine with 2GB RAM can process about 15 records per second in ideal conditions with a parsing fetcher without storing content. But this doesn't include indexing time, webgraph building or linkrank calculation. So we could achieve only about 10 records per second on average.
Another cluster with 16 cores and 16GB RAM each gives much better results so not only more hardware is better but more powerful hardware as well. With it we could, in ideal conditions, fetch and parse about 500 records per machine per second. When taking the other jobs into account it drops to an average of 300 records per second per machine. Under normal conditions it is between 150 and 250. With these figures you would have only a fraction of the internet after a year and not revisiting pages, even if you have a hunderd powerful machines. It's also impossible to do with a standard Nutch as you will quickly run into a lot of trouble with useless pages and crawler traps. Another very significant problem is duplicate websites such as www and non-www pages but these duplicates come in many more exotic varieties. You also have to manage extremely large black lists (many millions) of dead hosts. You need to prevent those from polluting your CrawlDB, dead URL's can quickly grow very large. Crawling the internet means managing a lot of crap. Good luck Markus -----Original message----- > From:Ryan L. Sun <[email protected]> > Sent: Mon 13-Aug-2012 20:58 > To: [email protected] > Subject: WWW wide crawling using nutch > > Hi all, > > I'm looking for some estimate/stat regarding WWW wide crawling using > nutch (or 10%/20% of WWW). What kind of hardware do u need and how > long it takes to finish one round of search? > > TIA. >

