On Mar 9, 2011, at 11:52am, Paul Dhaliwal wrote:
Hello,
On Wed, Mar 9, 2011 at 9:26 AM, Ken Krugler <[email protected]
> wrote:
On Mar 9, 2011, at 8:45am, Otis Gospodnetic wrote:
Hi,
Here's another Q about a wide, large-scale crawl resource
requirements on EC2 -
primarily storage and bandwidth needs.
Please correct any mistakes you see.
I'll use 500M pages as the crawl target.
I'll assume 10 KB/page on average.
It depends on what you're crawling, e.g. a number closer to 40KB/
page is what we've seen for text/HTML + images.
I would double/triple the page size estimate. Not sure what kind of
pages you are crawling, but there are some large pages out there.
Take a quick look at amazon.com, tripadvisor.com
There are large pages out there, but the _average_ size over 550M
pages was about 40K.
[snip]
I'm assuming 42 TB is enough for that.... is it?
Bandwidth is relatively cheap:
At $0.1 / GB for IN data, 5000 GB * $0.1 = $500
As per above, for us it was closer to $2100.
vps.net bandwidth is included, so we saved a bundle there.
Interesting, thanks for the ref.
So you immediately push data to your colo, which (I assume) also
doesn't have much of a data-in cost?
When you get one VPS system (composed of say 12 nodes), how many
virtual cores do you get? Is it also 12?
Thanks,
-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g