[jumping ahead and quoting myself]
AVS is not a mirroring technology, it is a remote replication technology.
So, yes, I agree 100% that people should not expect AVS to be a mirror.


Ralf Ramge wrote:
> [EMAIL PROTECTED] wrote:
>
>   
>>       War wounds?  Could you please expand on the why a bit more?
>>     
>
>
>
> - ZFS is not aware of AVS. On the secondary node, you'll always have to 
> force the `zfs import` due to the unnoticed changes of metadata (zpool 
> in use). No mechanism to prevent data loss exists, e.g. zpools can be 
> imported when the replicator is *not* in logging mode.
>   

ZFS isn't special in this regard, AFAIK all file systems, databases and
other data stores suffer from the same issue with remote replication.

> - AVS is not ZFS aware. For instance, if ZFS resilves a mirrored disk, 
> e.g. after replacing a drive, the complete disk is sent over the network 
> to the secondary node, even though the replicated data on the secondary 
> is intact.
> That's a lot of fun with today's disk sizes of 750 GB and 1 TB drives, 
> resulting in usually 10+ hours without real redundancy (customers who 
> use Thumpers to store important data usually don't have the budget to
> connect their data centers with 10 Gbit/s, so expect 10+ hours *per disk*).
>   

ZFS only resilvers data.  Other LVMs, like SVM, will resilver the entire 
disk,
though.

> - ZFS & AVS & X4500 leads to a bad error handling. The Zpool may not be 
> imported on the secondary node during the replication. The X4500 does 
> not have a RAID controller which signals (and handles) drive faults. 
> Drive failures on the secondary node may happen unnoticed until the 
> primary nodes goes down and you want to import the zpool on the 
> secondary node with the broken drive. Since ZFS doesn't offer a recovery 
> mechanism like fsck, data loss of up to 20 TB may occur.
> If you use AVS with ZFS, make sure that you have a storage which handles 
> drive failures without OS interaction.
>   

If this is the case, then array-based replication would also be similarly
affected by this architectural problem.  In other words, if you say that
a software RAID system cannot be replicated by a software replicator,
then TrueCopy, SRDF, and other RAID array-based (also software)
replicators also do not work.  I think there is enough empirical evidence
that they do work.  I can see where there might be a best practice here,
but I see no fundamental issue.

fsck does not recover data, it only recovers metadata.

> - 5 hours for scrubbing a 1 TB drive. If you're lucky. Up to 48 drives 
> in total.
>   

ZFS only scrubs data.  But it is not unusual for a lot of data scrubbing to
take a long time.  ZFS only performs read scrubs, so there is no replication
required during a ZFS scrub, unless data is repaired.

> - An X4500 has no battery buffered write cache. ZFS uses the server's 
> RAM as a cache, 15 GB+. I don't want to find out how much time a 
> resilver over the network after a power outage may take (a full reverse 
> replication would take up to 2 weeks and is no valid option in a serious 
> production environment). But the underlying question I asked myself is 
> why I should I want to replicate data in such an expensive way, when I 
> think the 48 TB data itself are not important enough to be protected by 
> a battery?
>   

ZFS will not be storing 15 GBytes of unflushed data on any system I can
imagine today.  While we can all agree that 48 TBytes will be painful to
replicate, that is not caused by ZFS -- though it is enabled by ZFS, because
some other file systems (UFS) cannot be as large as 48 TBytes.

> - I gave AVS a set of 6 drives just for the bitmaps (using SVM soft 
> partitions). Weren't enough, the replication was still very slow, 
> probably because of an insane amount of head movements, and scales
> badly. Putting the bitmap of a drive on the drive itself (if I remember 
> correctly, this is recommended in one of the most referenced howto blog 
> articles) is a bad idea. Always use ZFS on whole disks, if performance 
> and caching matters to you.
>   

I think there are opportunities for perormance improvement, but don't
know who is currently actively working on this.

Actually, the cases where ZFS for whole disks is a big win are small.
And, of course, you can enable disk write caches by hand.

> - AVS seems to require an additional shared storage when building 
> failover clusters with 48 TB of internal storage. That may be hard to 
> explain to the customer. But I'm not 100% sure about this, because I 
> just didn't find a way, I didn't ask on a mailing list for help.
>
>
> If you want a fail-over solution for important data, use the external 
> JBODs. Use AVS only to mirror complete clusters, don't use it to 
> replicate single boxes with local drives. And, in case OpenSolaris is 
> not an option for you due to your company policies or support contracts, 
> building a real cluster also A LOT cheaper.
>   

AVS is not a mirroring technology, it is a remote replication technology.
So, yes, I agree 100% that people should not expect AVS to be a mirror.

An earlier discussion on this forum dealt with the details of when the
write ordering must be preserved for ongoing operation.  But when a
full resync is required, the write ordering is not preserved.  The theory
is that this might affect ZFS more so than other file systems, or perhaps
ZFS might notice it more than other file systems.  But again, this affects
other remote replication technologies, also.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to