On Mar 9, 2011, at 11:52am, Paul Dhaliwal wrote:

Hello,

On Wed, Mar 9, 2011 at 9:26 AM, Ken Krugler <[email protected] > wrote:

On Mar 9, 2011, at 8:45am, Otis Gospodnetic wrote:

Hi,

Here's another Q about a wide, large-scale crawl resource requirements on EC2 -
primarily storage and bandwidth needs.
Please correct any mistakes you see.
I'll use 500M pages as the crawl target.
I'll assume 10 KB/page on average.

It depends on what you're crawling, e.g. a number closer to 40KB/ page is what we've seen for text/HTML + images.

I would double/triple the page size estimate. Not sure what kind of pages you are crawling, but there are some large pages out there. Take a quick look at amazon.com, tripadvisor.com

There are large pages out there, but the _average_ size over 550M pages was about 40K.

[snip]

I'm assuming 42 TB is enough for that.... is it?

Bandwidth is relatively cheap:
At $0.1 / GB for IN data, 5000 GB * $0.1 = $500

As per above, for us it was closer to $2100.
vps.net bandwidth is included, so we saved a bundle there.

Interesting, thanks for the ref.

So you immediately push data to your colo, which (I assume) also doesn't have much of a data-in cost?

When you get one VPS system (composed of say 12 nodes), how many virtual cores do you get? Is it also 12?

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to