Nigel Allen <[email protected]> writes:

> I need to formulate a DRP for a customer and thought that I would ask the
> slug for it's collective wisdom.

[...]

> First thought was to max out the memory on two of the servers, one for normal
> running and one as a hot or warm standby, and the virtualize all of the
> servers onto the two machines.

If you can having two machines in an active/active setup is much better: it
means that you have to spend less money on idle hardware, and it also means
that you don't face a *sudden* load on the passive machine.

The reason that last point is valuable is that the time you discover, say, a
disk is having trouble is when you start putting load on it, not when it sits
idle.  Guess when you really don't want to find out you have disk issues on
your second machine?

> An external consultant has already suggested doing this with VMware,
> installing the ESXi hypervisor on the two main servers and installing a NAS
> shared between the two systems (hot and cold) so that if the hot server
> fails, we can simply switch over to the cold server using the images from
> the NAS.

This would let you load-balance also, which is quite nice.



> Couple of things concern me about this approach. The first is using VMWare
> rather than a GPL solution.

*shrug*  You say you plan to run Win32 under this; you are going to need
binary PV drivers for the disk and network to get acceptable performance
anyway, so you are already looking down the barrel of non-GPL software.

> The second is where we would install the NAS. Physically, the office space
> is all under one roof but half the building has concrete floors and half has
> wooden. (The hot server is in the wooden "main" office, while the cold
> server was to go in the concrete floor area. There is also a firewall (a
> real one) in between the two areas).

In your server room, connected by at least one Gigabit link to the servers.

Your replicated NAS, of course, lives in your DR location, wherever that is,
since you don't want a DR solution that works as long as the server room never
catches fire[1].


> Questions:
>
> 1) Can anyone offer any gotcha's, regardless of how obvious they may seem to
>    you?

ESXi hardware support is exciting, make sure you have capable hardware.

Pay for commercial support on whatever solution you end up with.  At the end
of year one, think about dropping it, but keep it until then.

Test.  If you don't test this stuff routinely it will never, ever work when
you need it to.

You need PV disk and network drivers to get the performance you expect.

You don't need a PV kernel under Linux, though it probably doesn't hurt:
almost all the cost comes from the disk and network, and almost everything has
PV drivers for those.


Make sure you understand what happens if you pull the network from the (or a)
active machine without otherwise turning it off — what happens.

Make sure you don't spend millions on the best servers, the best NAS, then
connect them together through a single network cable that gets cut, bringing
the entire thing to a grinding halt.


> 2) Is there a GPL solution that fit's this scenario? Even if it's not a bare
>    metal hypervisor and needs an O/S. Remember it has to virtuaize both Server
>    2003 and CentOS

KVM can do what you want, but I don't believe there are PV disk drivers
available that are open source.  You need those.

> 3) What's the minimum connection we would need between the NSA and and the two
>    servers sharing it?

A 9600bps GSM modem, provided your users have very low expectations. ;)

More seriously: assuming your disk I/O use is low enough you could get away
with 100Mbit, but you really want Gigabit — and, given that you want to live
through failure, you want several Gigabit links between the NAS and the
servers so that a single *cable* failure doesn't take down your entire system.


> 4) What kind of speed/bandwidth should we be looking at for the off-site
>    replication.

That depends entirely on how much data you write during normal operation, and
how big your acceptable window for data loss is.  Talk to the vendor of your
NAS for details.

Generally, though, you want something with very low *latency* more than you
want something with very high bandwidth: having a *safe* write means that both
the local and DR NAS have acked the write as "being on disk".

If your latency is 10ms you have at least 10ms delay for every "safe" write.
If your latency is 20ms you double that, and cut your write performance in
half...

> I'll happily take anything else anyone would like to throw at this -
> suggestions, reading matter etc - it's not an area of great expertise for us
> having only paddled around the edges with Virtualbox.

This is *hard*.  Harder than it sounds.  Imagine it being as hard as you
think, and then it will likely be harder than that.

        Daniel

No, seriously, still harder.  Don't forget to test it, and expect to find
things go pear shaped and die *anyway* during normal running.

Footnotes: 
[1]  ...or the rack is stolen, or water leaks, or whatever else destroys the
     entire server room (or NAS location) in one fell swoop.

-- 
✣ Daniel Pittman            ✉ [email protected]            ☎ +61 401 155 707
               ♽ made with 100 percent post-consumer electrons
--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Reply via email to