Hi,
----- Original Message ---- > From: Ken Krugler <[email protected]> > To: [email protected] > Sent: Wed, March 9, 2011 12:26:59 PM > Subject: Re: EC2 storage needs for 500M URL crawl? > > > On Mar 9, 2011, at 8:45am, Otis Gospodnetic wrote: > > > Hi, > > > > Here's another Q about a wide, large-scale crawl resource requirements on >EC2 - > > primarily storage and bandwidth needs. > > Please correct any mistakes you see. > > I'll use 500M pages as the crawl target. > > I'll assume 10 KB/page on average. > > It depends on what you're crawling, e.g. a number closer to 40KB/page is > what >we've seen for text/HTML + images. > > > 500M pages * 10 KB/page = 5000 GB, which is 5 TB > > For our 550M page crawl, we pulled 21TB. OK, so 40 KB/page. How things have changed.... :) > > 5 TB is the size of just the raw fetched pages. > > > > Q: > > - What about any overhead besides the obvious replication factor, such as >sizes > > of linkdb and crawldb, any temporary data, any non-raw data in HDFS, and >such? > > - If parsed data is stored in addition to raw data, can we assume the parsed > > content will be up to 50% of the raw fetched data? > > > > Here are some calculations: > > > > - 50 small EC2 instances at 0.085/hour give us 160 GB * 50 = 8 TB for >$714/week > > - 50 large EC2 instances at 0.34/hour give us 850 GB * 50 = 42 TB for > > $2856/week > > (we can lower the cost by using Spot instances, but I'm just trying to > > keep >this > > simple for now) > > > > Sounds like either one needs more smaller instances (which should make >fetching > > faster) or one needs to use large instances to be able to store 500M pages > > + >any > > overhead. > > If you're planning on parsing the pages (sounds like it) then the m1.small >instances are going to take a very long time - their disk I/O and CPU are >pretty low-end. Yeah, I can imagine! :) But if your 550M page crawl pulled 21 TB of *raw*(?) data, then I have a feeling that even 40 large EC2 instances won't have enough storage, right? Would you recommend 75 of them (63 TB) or 100 of them (84 TB)? > > I'm assuming 42 TB is enough for that.... is it? > > > > Bandwidth is relatively cheap: > > At $0.1 / GB for IN data, 5000 GB * $0.1 = $500 > > As per above, for us it was closer to $2100. Thanks! Otis

