Hello,

On Wed, Mar 9, 2011 at 9:26 AM, Ken Krugler <[email protected]>wrote:

>
> On Mar 9, 2011, at 8:45am, Otis Gospodnetic wrote:
>
>  Hi,
>>
>> Here's another Q about a wide, large-scale crawl resource requirements on
>> EC2 -
>> primarily storage and bandwidth needs.
>> Please correct any mistakes you see.
>> I'll use 500M pages as the crawl target.
>> I'll assume 10 KB/page on average.
>>
>
> It depends on what you're crawling, e.g. a number closer to 40KB/page is
> what we've seen for text/HTML + images.


I would double/triple the page size estimate. Not sure what kind of pages
you are crawling, but there are some large pages out there.  Take a quick
look at amazon.com, tripadvisor.com

>
>
>  500M pages * 10 KB/page = 5000 GB, which is 5 TB
>>
>
> For our 550M page crawl, we pulled 21TB.
>
>
>  5 TB is the size of just the raw fetched pages.
>>
>> Q:
>> - What about any overhead besides the obvious replication factor, such as
>> sizes
>> of linkdb and crawldb, any temporary data, any non-raw data in HDFS, and
>> such?
>> - If parsed data is stored in addition to raw data, can we assume the
>> parsed
>> content will be up to 50% of the raw fetched data?
>>
>> Here are some calculations:
>>
>> - 50 small EC2 instances at 0.085/hour give us 160 GB *  50 = 8 TB for
>> $714/week
>> - 50 large EC2 instances at 0.34/hour give us 850 GB *  50 = 42 TB for
>> $2856/week
>> (we can lower the cost by using Spot instances, but I'm just trying to
>> keep this
>> simple for now)
>>
>
Money wise, it was cheaper for us to do vps.net. We also didn't use vps for
storage. We had a rack elsewhere which stores all of our data. We used the
cloud nature of vps.net just for crawling. It didn't make sense for us to
pay 4000 a month for something that literally costs $2000 one time cost and
few hundred a month in colo.

>
>> Sounds like either one needs more smaller instances (which should make
>> fetching
>> faster) or one needs to use large instances to be able to store 500M pages
>> + any
>> overhead.
>>
>
> If you're planning on parsing the pages (sounds like it) then the m1.small
> instances are going to take a very long time - their disk I/O and CPU are
> pretty low-end.
>
>
>  I'm assuming 42 TB is enough for that.... is it?
>>
>> Bandwidth is relatively cheap:
>> At $0.1 / GB for IN data, 5000 GB * $0.1 = $500
>>
>
> As per above, for us it was closer to $2100.
>
vps.net bandwidth is included, so we saved a bundle there.

Paul

>
> -- Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>
>

Reply via email to