Hello, On Wed, Mar 9, 2011 at 9:26 AM, Ken Krugler <[email protected]>wrote:
> > On Mar 9, 2011, at 8:45am, Otis Gospodnetic wrote: > > Hi, >> >> Here's another Q about a wide, large-scale crawl resource requirements on >> EC2 - >> primarily storage and bandwidth needs. >> Please correct any mistakes you see. >> I'll use 500M pages as the crawl target. >> I'll assume 10 KB/page on average. >> > > It depends on what you're crawling, e.g. a number closer to 40KB/page is > what we've seen for text/HTML + images. I would double/triple the page size estimate. Not sure what kind of pages you are crawling, but there are some large pages out there. Take a quick look at amazon.com, tripadvisor.com > > > 500M pages * 10 KB/page = 5000 GB, which is 5 TB >> > > For our 550M page crawl, we pulled 21TB. > > > 5 TB is the size of just the raw fetched pages. >> >> Q: >> - What about any overhead besides the obvious replication factor, such as >> sizes >> of linkdb and crawldb, any temporary data, any non-raw data in HDFS, and >> such? >> - If parsed data is stored in addition to raw data, can we assume the >> parsed >> content will be up to 50% of the raw fetched data? >> >> Here are some calculations: >> >> - 50 small EC2 instances at 0.085/hour give us 160 GB * 50 = 8 TB for >> $714/week >> - 50 large EC2 instances at 0.34/hour give us 850 GB * 50 = 42 TB for >> $2856/week >> (we can lower the cost by using Spot instances, but I'm just trying to >> keep this >> simple for now) >> > Money wise, it was cheaper for us to do vps.net. We also didn't use vps for storage. We had a rack elsewhere which stores all of our data. We used the cloud nature of vps.net just for crawling. It didn't make sense for us to pay 4000 a month for something that literally costs $2000 one time cost and few hundred a month in colo. > >> Sounds like either one needs more smaller instances (which should make >> fetching >> faster) or one needs to use large instances to be able to store 500M pages >> + any >> overhead. >> > > If you're planning on parsing the pages (sounds like it) then the m1.small > instances are going to take a very long time - their disk I/O and CPU are > pretty low-end. > > > I'm assuming 42 TB is enough for that.... is it? >> >> Bandwidth is relatively cheap: >> At $0.1 / GB for IN data, 5000 GB * $0.1 = $500 >> > > As per above, for us it was closer to $2100. > vps.net bandwidth is included, so we saved a bundle there. Paul > > -- Ken > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > > >

