Re: Comparing Nutch and Common Crawl

Lewis John Mcgibbney Wed, 19 Dec 2012 05:37:44 -0800

Hi Julien,

I've been winding down my Nutch server's this last few weeks in prep for
moving away.
I would however be very interested in stepping up to also provide some
stats come the new year. I don't know the duration of time you guys think
this should be carried out over however I am very keen to participate when
I can come early Janurary.


Best

Lewis

On Mon, Dec 17, 2012 at 9:59 PM, Markus Jelsma
<[email protected]>wrote:

> Hi,
>
> Interesting indeed. Apart from our customers we operate a cluster of a few
> high octane machines for research purposes that crawls the entire internet
> as much as it physically can. We run a modified Nutch 1.x and some custom
> jobs that analyze the crawled data and allow us to crawl the internet more
> efficiently. The cluster is far too small to quickly read through all the
> data. We only have 80GB of RAM and 80 CPU cores so it takes a while to read
> the ~760GB crawldb containing about 5.7 billion records, it takes about 40
> minutes. Compiling a webgraph and calculating the page rank takes about 32
> hours. Fetching and parsing is less intensive, at peak efficiency we can
> process over 700 pages per second including the reduce phase time and job
> set up and clean up.
>
> We can provide all the information you would like to have.
>
> Cheers.
>
> -----Original message-----
> > From:Julien Nioche <[email protected]>
> > Sent: Mon 17-Dec-2012 22:00
> > To: [email protected]; [email protected]
> > Cc: Lisa Green <[email protected]>
> > Subject: Comparing Nutch and Common Crawl
> >
> > Hi,
> >
> > I was chatting with the people from the Common Crawl project (
> www.commoncrawl.org <http://www.commoncrawl.org> ) this afternoon and we
> thought it would be interesting to have some sort of comparison between the
> space / memory / CPU requirements of their crawler and what it would take
> to process a similar amount with Nutch 1.x and 2.x. The aim is not so much
> to prove that one system is superior to the other (they both have their
> pluses and minuses) but to get a better picture of the situation.
> >
> > One way to do this would be to gather stats from Nutch users operating
> large crawls. Alternatively one could push the content of the CC dataset
> into e.g. Nutch 2 on Hbase to see how much space it would take and how the
> crawl would fare on that. I am pretty sure that would reveal all sorts of
> interesting issues and would be a good thing to do to test the Nutch + Gora
> stack.
> >
> > Would anyone be interested in sharing their stats? Anyone with spare
> time and machine to populate a crawldb with the CC dataset and get some
> stats?
> >
> > Thanks
> >
> > Julien
> >
> > --
> >  <http://digitalpebble.com/img/logo.gif>
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/>
> > http://www.digitalpebble.com <http://www.digitalpebble.com>
> > http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>
> >
> >
>



-- 
*Lewis*

Re: Comparing Nutch and Common Crawl

Reply via email to