Hi,

----- Original Message ----
> From: Ken Krugler <[email protected]>
> To: [email protected]
> Sent: Wed, March 9, 2011 12:26:59 PM
> Subject: Re: EC2 storage needs for 500M URL crawl?
> 
> 
> On Mar 9, 2011, at 8:45am, Otis Gospodnetic wrote:
> 
> > Hi,
> > 
> > Here's another Q about a wide, large-scale crawl resource requirements  on 
>EC2 -
> > primarily storage and bandwidth needs.
> > Please correct  any mistakes you see.
> > I'll use 500M pages as the crawl target.
> >  I'll assume 10 KB/page on average.
> 
> It depends on what you're crawling,  e.g. a number closer to 40KB/page is 
> what 
>we've seen for text/HTML +  images.
> 
> > 500M pages * 10 KB/page = 5000 GB, which is 5 TB
> 
> For  our 550M page crawl, we pulled 21TB.

OK, so 40 KB/page.  How things have changed.... :)

> > 5 TB is the size of just the  raw fetched pages.
> > 
> > Q:
> > - What about any overhead besides  the obvious replication factor, such as 
>sizes
> > of linkdb and crawldb, any  temporary data, any non-raw data in HDFS, and 
>such?
> > - If parsed data is  stored in addition to raw data, can we assume the 
parsed
> > content will be  up to 50% of the raw fetched data?
> > 
> > Here are some  calculations:
> > 
> > - 50 small EC2 instances at 0.085/hour give us  160 GB *  50 = 8 TB for 
>$714/week
> > - 50 large EC2 instances at  0.34/hour give us 850 GB *  50 = 42 TB for
> > $2856/week
> > (we  can lower the cost by using Spot instances, but I'm just trying to 
> > keep  
>this
> > simple for now)
> > 
> > Sounds like either one needs more  smaller instances (which should make 
>fetching
> > faster) or one needs to  use large instances to be able to store 500M pages 
> > + 
>any
> >  overhead.
> 
> If you're planning on parsing the pages (sounds like it) then  the m1.small 
>instances are going to take a very long time - their disk I/O and  CPU are 
>pretty low-end.

Yeah, I can imagine! :)
But if your 550M page crawl pulled 21 TB of *raw*(?) data, then I have a 
feeling 
that even 40 large EC2 instances won't have enough storage, right?
Would you recommend 75 of them (63 TB) or 100 of them (84 TB)?

> > I'm assuming 42 TB is enough for that.... is  it?
> > 
> > Bandwidth is relatively cheap:
> > At $0.1 / GB for IN  data, 5000 GB * $0.1 = $500
> 
> As per above, for us it was closer to  $2100.

Thanks!

Otis

Reply via email to