Re: [zfs-discuss] Mirrored zpool across network

Jim Dunham Wed, 22 Aug 2007 06:29:23 -0700

Ralf,

> Well, and what I want to say: if you place the bitmap volume on the  
> same
> disk, this situation even gets worse. The problem is the  
> involvement of
> SVM. Building the soft partition again makes the handling even more
> complex and the case harder to handle for operators. It's the best way
> to make sure that the disk will be replaced, but not added to the  
> zpool
> during the night - and replacing it during regular working hours isn't
> an option too, because syncing 500 GB over a 1 GBit/s interface during
> daytime just isn't possible without putting the guaranteed service  
> times
> to a risk. Having to take care about soft partitions just isn't
> idiot-proof enough. And *poof* there's a good chance the TCO of a  
> X4500
> is considered being too high.


You are quite correct in that increasing the number of data path  
technologies ZFS + AVS + SVM, increases the TCO, as the skills  
required by everyone involved must increase proportionately. For the  
record, using ZFS zvols for bitmap volumes does not scale, as the  
overhead of bit flipping is way too many I/Os for raidz or raidz2  
storage pools, and even a mirrored storage pool is high, as the COW  
semantics of ZFS make the I/O cost too high.

>
>>> a) A disk in the primary fails. What happens? A HSP jumps in and  
>>> 500 GB
>>> will be rebuilt. These 500 GB are synced over a single 1 GBit/s
>>> crossover cable. This takes a bit of time and is 100% unnecessary
>>
>>
>> But it is necessary! As soon as the HSP disk kicks in, not only is  
>> the
>> disk being rebuilt by ZFS, but newly allocated ZFS data will also
>> being written to this HSP disk. So although it may appear that there
>> is wasted replication cost (of which there is), the instant that ZFS
>> writes new data to this HSP disk, the old replicated disk is  
>> instantly
>> inconsistent, and there is no means to fix.
> It's necessary from your point of view, Jim. But not in the minds  
> of the
> customers. Even worse, it could be considered a design flaw - not in
> AVS, but in ZFS.

I wouldn't go that far as to say it is a design flaw. The fact that  
AVS works with ZFS, and vice-versa, without either having knowledge  
of each other's presence, says a lot for the I/O architecture of  
Solaris. If there is a compelling advantage to interoperate, the  
OpenSolaris community as a whole is free to propose a project, gather  
community interest, and go from there. The potential of OpenSolaris  
is huge, especially when it is ridding a technology wave, like the  
one created by the x4500 and ZFS.


> Just have a look how the usual Linux dude works. He doesn't use  
> AVS, he
> uses a kernel module called DRBD. It does basically the same, it
> replicates one raw device to another over a network interface, like  
> AVS
> does. But the linux dude has one advantage: he doesn't have ZFS.  
> Yes, as
> impossible as it may sound, it is an advantage. Why? Because he never
> has to mirror 40 or 46 devices, because his lame file systems  
> depend on
> a hardware RAID controller! Same goes with UFS, of course. There's  
> only
> ONE replicated device, no matter how many discs are involved.
> And so, it's definitely NOT necessary to sync a disc when a HSP kicks
> in, because this disc failure will never be reported to the host, it's
> handled by the RAID controller. As a result, no replication will take
> place, because AVS simply isn't involved. We even tried to deploy ZFS
> upon SVM RAID5 stripes to get rid of this problem, just to learn how
> much the RAID 5 performance of SVM sucks ... a cluster of six USB  
> sticks
> was faster than the Thumpers.

Instead of using SVM for RAID 5, too keep the volume count low,  
consider concatenating 8 devices (RAID 0) into 5 separate SVM  
volumes, then configuring both a ZFS raidz storage pool, plus AVS on  
these 5 volumes. This prevents SVM from performing software RAID 5,  
RAID 0 is a low-overhead pass thru for SVM, plus prior to giving the  
entire SVM volume to ZFS, one can also get the AVS bitmaps form this  
pool too.

> I consider this a big design flaw of ZFS. I'm not very familiar  
> with the
> code, but I still have hope that there'll be a parameter which  
> allows to
> get rid of the cache flushes. ZFS, and the X4500, are typical examples
> of different departments not really working together, e.g. they have a
> wonderful file system, but there are no storages who supports it. Or a
> great X4500, a 11-24 TB file server for $40,000, but no options to  
> make
> it highly available like the $1,000 boxes. AVS is, in my opinion,
> clearly one of the components which suffers from it. The Sun marketing
> and Jonathan still have a long way to go. But, on the other hand,
> difficult customers like me and my company are always happy to  
> point out
> some difficulties and to help resolving them :-)

Sun does recognize the potential of both the X4500 and ZFS, and also  
of the difficulties (and problems) when combining them together. It  
would be great if there was pre-existing technology (hardware,  
software, or both), that just made this high availability issue go  
away, without adding any complexity.

>
>> For all that is good (or bad) about AVS, the fact that it works by
>> simply interposing itself on the Solaris I/O data path is great,  
>> as it
>> works with any Solaris block storage. Of course this also means that
>> it has not filesystem, database or host-spare knowledge, which means
>> that at times AVS will be inefficient at what it does.
>>
> I don't think that there's a problem with AVS and its concepts. In my
> opinion, ZFS has to do the homework. At least it should be aware of  
> the
> fact that AVS is involved. Or has been, when it comes to recovering  
> data
> from a zpool - simply saying "the discs belong exclusively to the  
> local
> ZFS, and no other mechanisms can write onto the discs, so let's panic
> and lose all the terabytes of important data" just isn't valid. It may
> be easy and comfortable for the ZFS development department, but it
> doesn't refelct the real world - and not even Suns software portfolio.
> The AVS integration into Nevada makes this even worse and I hope
> there'll be something like fsck in the future, something which  
> allows me
> to recover the files with correct checksums from a zpool, instead of
> simply hearing the sales droids repeat "There can't be any errors,
> NEVER!" over and over again :-)

I don't think there is any single technology that is too blame here,  
unless of course that technology is as you put it "Suns software  
portfolio". The "ZFS development department" has done an excellent  
job in meeting, and exceeding what they set out to accomplish, and  
then more. The even offer remote file replication via send / recv.

What was not taken into consideration, and its is unclear where this  
falls, is that any Solaris filesystem can be replicated by either  
host-based or controller-based data services, and the need to  
assuring data consistency of that replicated filesystem. Concerned as  
you are about system panics, ZFS is doing the correct thing in  
validating checksums, and panicing Solaris under circumstances ZFS  
considers to be data corruption. Do the same types of operations with  
other filesystems, and these undetected writes, are essentially  
silent data corruption.

The fact that ZFS validate data on reads is powerful.


>
>>
>>> - and
>>> it will become much worse in the future, because the disk capacities
>>> rocket up into the sky, while the performance isn't improved as  
>>> much.
>>
>> Larger disk capacities are now worse in this scenario, then they are
>> with controller-based replication, ZFS send / receive, etc. Actually
>> it is quite efficient. If the disk that failed was one 5% full, when
>> the HSP disk is switch and being rebuilt, old 5% of the entire disk
>> will have to be replicated. If at the time ZFS and AVS were deployed
>> on this server, if they HSP disks (containing uninitialized data)  
>> were
>> also configured as equal with "sndradm -E ...", then there would be
>> not initial replication cost, and when swapped into use, only the  
>> cost
>> of replicating the actual ZFS in use data.
> That's interesting. Because, together with your "data and bitmap  
> volume
> on the same disk" scenario, the bitmap volume would be lost. A full  
> sync
> of the disc would be necessary then, even if only 5% are in use. Am I
> correct?

My scenario used SVM mirrored bitmaps for AVS, and ZFS is protecting  
by its raidz or mirrored storage pool. When one looses a disk, SVM  
continues to use the other side of the mirror for AVS bitmaps, ZFS  
uses the redundancy of its storage pool. When the failed disk is  
replaced, SVM needs to resilver, ZFS needs to rebuild, either on  
demand, or via zpool scrub. All is good.

>
>>
>>> During this time, your service misses redundancy.
>>
>> Absolute not. If all of the ZFS in use and ZFS HSP disks are
>> configured under AVS, there is never a time of lost redundancy.
>>
> I'm sure there is, as soon as a disc crashed in the secondary and the
> primary disc is in logging mode for several hours. I bet you'll lose
> your HA as soon as the primary crashes before the secondary is in sync
> again, because the global ZFS metadata weren't logged, but updated.

Redundancy, based on my understanding, is recovery from a single  
failure. What you allude to above is two (or more) failures,  
something not covered with simple redundancy. The need to be able to  
recover from multiple failures, is clearly a known concept, hence the  
creation of raidz2, knowing that loosing two disks in raidz is bad news.

Using AVS to replicate a ZFS storage pool, offers something AVS has  
never had, the ability for ZFS to validate that AVS's replication was  
indeed perfect. Drop the replica into logging mode, zpool import,  
zpool scrub, zpool export, resume replication.

> I
> think to avoid this, the primary would have to sent the entire
> replication group into logging mode - but then it would get even  
> worse,
> because you'll lose your redundancy for days until the secondary is  
> 100%
> in sync again and the regular replicating state becomes active (a full
> sync of a X4500 takes at least 5 days, and only when you don't have  
> Sun
> Cluster with exlclusive interconnect interfaces up and running).
>
> Linux/DRBD: Some data will be missing and you'll have fun fsck'ing for
> two hours.
> ZFS: The secondary is not consistent, zpool is FAULTED, all data is
> lost, you have a downtime while recovering from backup tapes, plus a
> week with reduced redundancy because of the time needed for resyncing
> the restored data. You want three cluster nodes in most deployment
> scenarios, not just two, believe me ;-) It doesn't matter much if you
> only use several easy to restore videos. But I talk about file servers
> which host several billion inodes, like the file servers which host  
> the
> mail headers, bodies and attachments for a million Yahoo users, a
> terabyte of moving data each day which cannot be backuped to tape.
>
>>> And we're not talking
>>> about some minutes during this time. Well, and now try to imagine  
>>> what
>>> will happen if another disks fails during this rebuild, this time  
>>> in the
>>> secondary ...
>>
>> If I was truly counting on AVS, I would be glad this happened!  
>> Getting
>> replication configured right, be it AVS or some other option, means
>> that when disks, systems, networks, etc., fail, there is always a
>> period of degraded system performance, but it is better then no  
>> system
>> performance.
>>
> That's correct. But don't forget that it's always a very small step  
> from
>  availability scenarios in data centers, because in such scenarios  
> you'll
> always have to rely on other people with less know-how and motivation.
> It's easy to accept a degraded state as long as you're in your office.
> But with an X4500, your degraded state may potentially last longer  
> than
> a weekend and when you're directly responsible for the mail of  
> millions
> of user and you know that any non-availability will place your name on
> Slashdot (or the name of your CEO, wich equals placing your head on a
> scaffold), I'm sure you'll think twice about using ZFS with AVS or
> letting the linux dudes continue to play with their inefficient  
> boxes :-)

All very valid points, and having reassurance that choices made today  
will prove themselves valuable if, and when degraded or faulted  
states arise, is key. I am a strong proponent of disaster recovery  
testing, long before your company, or CxOs signoff on a solution put  
into production. You are right to question, and arrive at your own  
informed conclusions about the technologies you choose before  
deployment.

>
>> But if a disaster happened on the primary node, and a decision was
>> made to ZFS import the storage pool on the secondary, ZFS will detect
>> the inconsistency, mark the drive as failed, swap in the secondary  
>> HSP
>> disk. Later, when the primary site comes back, and a reverse
>> synchronization is done to restore writes that happened on the
>> secondary, the primary ZFS file system will become aware that a HSP
>> swap occurred, and continue on right where the secondary node left  
>> off.
> I'll try that as soon as I have a chance again (which means: as  
> soon as
> Sun gets the Sun Cluster working on a X4500).
>
>>> c) You *must* force every single `zfs import <zpool>` on the  
>>> secondary
>>> host. Always.
>>
>> Correct, but this is the case even without AVS! If one configured ZFS
>> on SAN based storage and your primary node crashed, one would need to
>> force every single `zfs import <zpool>`. This is not an AVS issue,  
>> but
>> a ZFS protection.
> Right. Too bad ZFS reacts this way.
>
> I have to admit that you made me nervous once, when you wrote that
> forcing zpool imports would be a bad idea ...

I think there was some context to my prior statement, as in checking  
the current state of replication before doing so. ;-)

>
> [X] Zfsck now! Let's organize a petition. :-)
>
>> Correct, but this is the case even without AVS! Take the same SAN
>> based storage scenario above, go to a secondary system on your SAN,
>> and force every single `zfs import <zpool>`.
>>
> Yes, but on a SAN, I don't have to worry about zpool inconsistency,
> because the zpool always resides on the same devices.

Point well taken.

>
>> In the case of a SAN,  where the same physical disk would be  
>> written to
>> by both hosts, you would likely get complete data loss, but with AVS,
>> where ZFS is actually on two physical disk, and AVS is tracking
>> writes, even if they are inconsistent writes, AVS can and will  
>> recover
>> if an update sync is done.
> My problem is that there's no ZFS mechanism which allows me to verify
> the zpool consistency before I actually try to import it. Like I said
> before: AVS does it right, just ZFS doesn't (and otherwise it wouldn't
> make sense to discuss it on this mailinglist anyway :-) ).
>
> It could really help me with AVS if there was something like "zpool
> check <zpool>", something for checking a zpool before an import. I  
> could
> do a cronjob which puts the secondary host into logging mode, run a
> "zpool check" and continue with the replication  a few hours  
> afterwards.
> Would let me sleep better and I wouldn't have to pray to the IT gods
> before an import. ou know, I saw literally *hundreds* of kernel panics
> during my tests, that made me nervous. I have scripts which do the job
> now, but I saw the risks and the things which can go wrong if someone
> else without my experience does it (like the infamous "forgetting to
> manually place the secondary in the logging mode before trying to  
> import
> a zpool").

The issue is not the need for a "zpool check", or to improve on  
"zpool import", or ZFS itself, as each could validate the storage  
pool as being 100% perfect, if at the moment they are running, ZFS on  
the primary node was not writing data, data which may be actively  
replicating to the secondary node.

The problem (or lack of a feature), is that ZFS does not support  
shared access to a single storage poll. ZFS on one node, in seeing  
ZFS writes issued another node (being a dual-port disked, SAN disk,  
AVS replication, or controller based replication), views these ZFS  
writes and their CRC data, as a form of data corruption, and  
rightfully ZFS panics Solaris

I know that the shared QFS filesystem supports careful, ordered  
writes, which allows their shared QFS reader client to read (only)  
from an active replica, being AVS or controller based replication. As  
with QFS, given time, ZFS will evolve.

>
>> Your are quite correct in that although ZFS is intuitively easy to
>> use, AVS is painfully complex. Of course the mindset of AVS and ZFS
>> are as distant apart as they are in the alphabet. :-O
>>
> AVS was easy to learn and isn't very difficult to work with. All you
> need is 1 or 2 months of testing experience. Very easy with UFS.
>
>> With AVS in Nevada, there is now an opportunity for leveraging the
>> ease of use of ZFS, with AVS. Being also the iSCSI Target project
>> lead, I see a lot of value in the ZFS option "set shareiscsi=on", to
>> get end users in using iSCSI.
>>
> Too bad the X4500 has too few PCI slots to consider buying iSCSI  
> cards.

HBA manufactures have in the past created multi-port, and multi- 
function HBAs. I would expect there to be something out there, or out  
there soon which will address the need of limited PCI slots.


> The two existing slots are already needed for the Sun Cluster
> interconnect. I think iSCSI won't be real option unless the servers  
> are
> shipped with it onboard, like it has been done in the past with the  
> SCSI
> or ethernet interfaces.
>
>> I would like to see "set replication=AVS:<secondary host>",
>> configuring a locally named ZFS storage pool to the same named  
>> pair on
>> some remote host. Starting down this path would afford things like  
>> ZFS
>> replication monitoring, similar to what ZFS does with each of its own
>> vdevs.
> Yes! Jim, I think we'll become friends :-) Who do I have to send the
> bribe money to?

Sun Microsystems, Inc., as in buying Sun Servers, Software, Storage  
and Services.
Non-monetary offerings in the form of being an active OpenSolaris  
community member, are also highly valued.


>
> -- 
>
> Ralf Ramge
> Senior Solaris Administrator, SCNA, SCSA
>
> Tel. +49-721-91374-3963
> [EMAIL PROTECTED] - http://web.de/
>
> 1&1 Internet AG
> Brauerstraße 48
> 76135 Karlsruhe
>
> Amtsgericht Montabaur HRB 6484
>
> Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich,  
> Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang,  
> Achim Weiss
> Aufsichtsratsvorsitzender: Michael Scheeren
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Jim Dunham
Solaris, Storage Software Group

Sun Microsystems, Inc.
1617 Southwood Drive
Nashua, NH 03063
Email: [EMAIL PROTECTED]
http://blogs.sun.com/avs



_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mirrored zpool across network

Reply via email to