Re: [zfs-discuss] RAIDZ one of the disk showing unavail

2008-09-29 Thread Ralf Ramge
Miles Nordin wrote:

 Ralf, aren't you missing this obstinence-error:
 
 sc the following errors must be manually repaired:
 sc /dev/dsk/c0t2d0s0 is part of active ZFS pool export_content.
 
 and he used the -f flag.

No, I saw it. My understanding has been that the drive was unavailable 
right after the *creation* of the zpool. And replacing a broken drive 
with itself doesn't make sense. And after replacing the drive with a 
working one, ZFS should recognize this automatically.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAIDZ one of the disk showing unavail

2008-09-26 Thread Ralf Ramge
Srinivas Chadalavada wrote:

  I see the first disk as unavailble, How do i make it online? 

By replacing it with a non-broken one.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 vs AVS ?

2008-09-22 Thread Ralf Ramge
.  This is a cheap 
workaround, but honestly: You can use something like this for your own 
datacenter, but I bet nobody wants to sell it to a customer as a 
supported solution ;-)


-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Inconsistent df and du output ?

2008-09-22 Thread Ralf Ramge
Juris Krumins wrote:

  lun.0 file, which is at least 20Gb big resides on /export/storage. Why
  df shows only 4.9 GB ?

---
-bash-3.2# ls -la
total 4194871
drwxr-xr-x   2 root sys3 Sep 18 17:28 .
drwxr-xr-x   5 root root   8 Sep 18 17:44 ..
-rw---   1 root sys  2147483648 Sep 18 17:42 lun.0
---

Let's make the total file size a bit more human readable: 2,147,483,648 
byte.

That's 2 GB, not 20. Try `ls -alh`.

Concerning df:

---
-bash-3.2# df -h
Filesystem size   used  avail capacity  Mounted on
[...]
rpool/export65G   4.8G54G 9%/export
[...]
---

Looks good to me.

Or did I miss something and understood you wrong?

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 vs AVS ?

2008-09-17 Thread Ralf Ramge
Jorgen Lundman wrote:

 If we were interested in finding a method to replicate data to a 2nd 
 x4500, what other options are there for us? 

If you already have an X4500, I think the best option for you is a cron 
job with incremental 'zfs send'. Or rsync.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] A few questions

2008-09-17 Thread Ralf Ramge
gm_sjo wrote:

 Are you not infact losing performance by reducing the
 amount of spindles used for a given pool?

This depends. Usually, RAIDZ1/2 isn't a good performancer when it comes 
to random access read I/O, for instance. If I wanted to scale 
performance by adding spindles, I would use mirrors (RAID 10). If you 
want to scale filesystem sizes, RAIDZ is your friend.

I once had the problem that I needed a high random I/O performance and 
at least a 11 TB large filesystem on a X4500. Mirroring was out of the 
question (not enough disk space left), and RAIDZ gave me only about 25% 
of the performance of the existing Linux ext2 boxes I had to compete 
with. But in the end, striping 13 RAIDZ sets of 3 drives each + 1 hot 
spare delivered acceptable results in both categories. But it took me a 
lot of benchmarks to get there.


-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 vs AVS ?

2008-09-10 Thread Ralf Ramge
Matt Beebe wrote:

 But what happens to the secondary server?  Specifically to its bit-for-bit 
 copy of Drive #2... presumably it is still good, but ZFS will offline that 
 disk on the primary server, replicate the metadata, and when/if I promote 
 the seconday server, it will also be running in a degraded state (ie: 3 out 
 of 4 drives).  correct?



Correct.

 In this scenario, my replication hasn't really bought me any increased 
 availablity... or am I missing something?  



No. You have an increase of availability when the entire primary node 
goes down, but you're not particularly safer when it comes to decreased 
zpools.


 Also, if I do chose to fail over to the secondary, can I just to a scrub the 
 broken drive (which isn't really broken, but the zpool would be 
 inconsistent at some level with the other online drives) and get back to 
 full speed quickly? or will I always have to wait until one of the servers 
 resilvers itself (from scratch?), and re-replicates itself??


I have not tested this scenario, so I can't say anything about this.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 vs AVS ?

2008-09-09 Thread Ralf Ramge
Richard Elling wrote:

 Yes, you're right. But sadly, in the mentioned scenario of having 
 replaced an entire drive, the entire disk is rewritten by ZFS.
 
 No, this is not true.  ZFS only resilvers data.

Okay, I see we have a communication problem here. Probably my fault, I
should have written the entire data and metadata.
I made the assumption that a 1 TB drive in a X4500 may have up to 1 TB
of data on it. Simply because nobody buys the 1 TB X4500 just to use 10%
of the disk space, he would have bought the 250 GB, 500 GB or 750 GB
model then.
In any case and any disk size scenario, that's something you don't want
to have on your network if there's a chance to avoid this.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss,
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 vs AVS ?

2008-09-08 Thread Ralf Ramge
 of trying to start a flame 
war. From now on, I leave the rest to you, because I earn my living with 
products of Sun Microsystems, too, and I don't want to damage neither 
Sun nor this mailing list.


-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 vs AVS ?

2008-09-05 Thread Ralf Ramge
[EMAIL PROTECTED] wrote:

   War wounds?  Could you please expand on the why a bit more?



- ZFS is not aware of AVS. On the secondary node, you'll always have to 
force the `zfs import` due to the unnoticed changes of metadata (zpool 
in use). No mechanism to prevent data loss exists, e.g. zpools can be 
imported when the replicator is *not* in logging mode.

- AVS is not ZFS aware. For instance, if ZFS resilves a mirrored disk, 
e.g. after replacing a drive, the complete disk is sent over the network 
to the secondary node, even though the replicated data on the secondary 
is intact.
That's a lot of fun with today's disk sizes of 750 GB and 1 TB drives, 
resulting in usually 10+ hours without real redundancy (customers who 
use Thumpers to store important data usually don't have the budget to
connect their data centers with 10 Gbit/s, so expect 10+ hours *per disk*).

- ZFS  AVS  X4500 leads to a bad error handling. The Zpool may not be 
imported on the secondary node during the replication. The X4500 does 
not have a RAID controller which signals (and handles) drive faults. 
Drive failures on the secondary node may happen unnoticed until the 
primary nodes goes down and you want to import the zpool on the 
secondary node with the broken drive. Since ZFS doesn't offer a recovery 
mechanism like fsck, data loss of up to 20 TB may occur.
If you use AVS with ZFS, make sure that you have a storage which handles 
drive failures without OS interaction.

- 5 hours for scrubbing a 1 TB drive. If you're lucky. Up to 48 drives 
in total.

- An X4500 has no battery buffered write cache. ZFS uses the server's 
RAM as a cache, 15 GB+. I don't want to find out how much time a 
resilver over the network after a power outage may take (a full reverse 
replication would take up to 2 weeks and is no valid option in a serious 
production environment). But the underlying question I asked myself is 
why I should I want to replicate data in such an expensive way, when I 
think the 48 TB data itself are not important enough to be protected by 
a battery?


- I gave AVS a set of 6 drives just for the bitmaps (using SVM soft 
partitions). Weren't enough, the replication was still very slow, 
probably because of an insane amount of head movements, and scales
badly. Putting the bitmap of a drive on the drive itself (if I remember 
correctly, this is recommended in one of the most referenced howto blog 
articles) is a bad idea. Always use ZFS on whole disks, if performance 
and caching matters to you.

- AVS seems to require an additional shared storage when building 
failover clusters with 48 TB of internal storage. That may be hard to 
explain to the customer. But I'm not 100% sure about this, because I 
just didn't find a way, I didn't ask on a mailing list for help.


If you want a fail-over solution for important data, use the external 
JBODs. Use AVS only to mirror complete clusters, don't use it to 
replicate single boxes with local drives. And, in case OpenSolaris is 
not an option for you due to your company policies or support contracts, 
building a real cluster also A LOT cheaper.


-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 vs AVS ?

2008-09-04 Thread Ralf Ramge
Jorgen Lundman wrote:

 We did ask our vendor, but we were just told that AVS does not support 
 x4500.


The officially supported AVS works on the X4500 since the X4500 came 
out. But, although Jim Dunham and others will tell you otherwise, I 
absolutely can *not* recommend using it on this hardware with ZFS, 
especially with the larger disk sizes. At least not for important, or 
even business critical data - in such a case, using X41x0 servers with
J4500 JBODs and a HAStoragePlus Cluster instead of AVS may be a much 
better and more reliable option, for basically the same price.




-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 vs AVS ?

2008-09-04 Thread Ralf Ramge
Brent Jones wrote:

 I did some Googling, but I saw some limitations sharing your ZFS pool
 via NFS while using HAStorage Cluster product as well.
[...]
   If you are using the zettabyte file system (ZFS) as the exported file
 system, you must set the sharenfs property to off.

That's not a limitation, just looks like one. The cluster's resource 
type called SUNW.nfs decides if a file system is shared or not. And it 
does this with the usual share and unshare commands in a separate 
dfstab file. The ZFS sharenfs flag is set to off to avoid conflicts.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS hangs/freezes after disk failure,

2008-08-25 Thread Ralf Ramge
 the percentage of pain during a desaster by spending more 
money, e.g. by making the SATA controllers redundant and creating a 
mirror (than controller 1 will hang, but controller 2 will continue 
working), but you must not forget that your PCI bridges, fans, power  
supplies, etc. remain single points of failures why can take the entire 
service down like your pulling of the non-hotpluggable drive did.

c) If you want both, you should buy a second server and create a NFS 
cluster.

Hope I could help you a bit,

  Ralf

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim 
Weiss 
Aufsichtsratsvorsitzender: Michael Scheeren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS hangs/freezes after disk failure,

2008-08-25 Thread Ralf Ramge
Ralf Ramge wrote:
[...]

Oh, and please excuse the grammar mistakes and typos. I'm in a hurry, 
not a retard ;-) At least I think so.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim 
Weiss 
Aufsichtsratsvorsitzender: Michael Scheeren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance with Sun StorageTek 2540

2008-02-18 Thread Ralf Ramge
Mertol Ozyoney wrote:

 2540 controler can achieve maximum 250 MB/sec on writes on the first 
 12 drives. So you are pretty close to maximum throughput already.

 Raid 5 can be a little bit slower.


I'm a bit irritated now. I have ZFS running for some Sybase ASE 12.5 
databases using X4600 servers (8x dual core, 64 GB RAM, Solaris 10 
11/06) and 4 GBit/s lowest cost Infortrend Fibrechannel JBODs with a 
total of 4x 16 FC drives imported in a single mirrored zpool. I 
benchmarked them with tiobench, using a filesize of 64 GB and 32 
parallel threads. With an untweaked ZFS the average throughput I got 
was: sequential  random read  1GB/s, sequential write 296 MB/s, random 
write 353 MB/s, leading to a total of approx. 650,000 IOPS with a 
maximum latency of  350 ms after the databases went into production and 
the bottleneck are basically the FC HBA's. These are averages, the peaks 
flatline  with reaching the 4 GBit/s FibreChannel maximum capacity 
pretty soon afterwards.

I'm a bit disturbed because I think about switching to 2530/2540 
shelves, but a maximum 250 MB/sec would disqualify them instantly, even 
with individual RAID controllers for each shelf. So my question is: Can 
I do the same thing I did with the IFT shelves, can I buy only 2501 
JBOBDs and attach them directly to the server, thus *not* using the 2540 
raid controller and still having access to the single drives?

I'm quite nervous about this, because I'm not just talking about a 
single databases - I'd need a total number of 42 shelves and I'm pretty 
sure SUN doesn't offer TryBuy deals at such a scale.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, 
Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Norbert Lang, 
Achim Weiss 
Aufsichtsratsvorsitzender: Michael Scheeren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Avoiding performance decrease when pool usage is over 80%

2008-02-12 Thread Ralf Ramge
Thomas Liesner wrote:
 Nobody out there who ever had problems with low diskspace?

   
Okay, I found your original mail :-)

Quotas are applied to file systems, not pools, and a such are pretty 
independent from the pool size. I found it best to give every user 
his/her own filesystem and applying individual quotas afterwards.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, 
Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Norbert Lang, 
Achim Weiss 
Aufsichtsratsvorsitzender: Michael Scheeren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Avoiding performance decrease when pool usage is over 80%

2008-02-12 Thread Ralf Ramge
Thomas Liesner wrote:
 Does this mean, that if i have a pool of 7TB with one filesystem for all 
 users with a quota of 6TB i'd be alright?
   
Yep. Although I *really* recommend creating individual file systems, 
e.g. if you have 1,000 users on your server, I'd create 1,000 file 
systems with a quota of  6 GB each.  Easier to handle, more flexible to 
use, easier to backup, it allows better use of snapshots and it's easier 
to migrate single users to other servers.

 The usage of that fs would never be over 80%, right?

   
Nope.

Don't mix up pools and file systems. your pool of 7TB will only be 
filled to a maximum of 6TB, but the file system will be 100% full. which 
shouldn't impact your overall performance.

 Like in the following example for the pool shares with a poolsize of 228G 
 an one fs with a quota of 100G:

 shares 228G28K   220G 1%/shares
 shares/production   100G   8,4G92G 9%/shares/production

 This would suite me perfectly, as this would be exactly what i wanted to do ;)

   
Yep, you got it.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, 
Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Norbert Lang, 
Achim Weiss 
Aufsichtsratsvorsitzender: Michael Scheeren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4500 ILOM thinks disk 20 is faulted, ZFS thinks not.

2007-12-04 Thread Ralf Ramge
Jason J. W. Williams wrote:
 Have any of y'all seen a condition where the ILOM considers a disk
 faulted (status is 3 instead of 1), but ZFS keeps writing to the disk
 and doesn't report any errors? I'm going to do a scrub tomorrow and
 see what comes back. I'm curious what caused the ILOM to fault the
 disk. Any advice is greatly appreciated.
   
What does `iostat -E` tell you?

I've experienced several times that ZFS is very fault tolerant - a bit 
too tolerant for my taste - when it comes to faulting a disk. I saw 
external FC drives with hundreds or even thousands of errors, even 
entire hanging loops or drives with hardware trouble, and neither ZFS 
nor /var/adm/messages reported a problem. So I prefer examining the 
iostat output over `zpool status` - but with the unattractive side 
effect that it's not possible to reset the error count which iostat 
reports without a reboot, so this method is not suitable for monitoring 
purposes.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, 
Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss 
Aufsichtsratsvorsitzender: Michael Scheeren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-12 Thread Ralf Ramge
Gino wrote:
[...]

 Just a few examples:
 -We lost several zpool with S10U3 because of spacemap bug,
 and -nothing- was recoverable.  No fsck here :(

   
Yes, I criticized the lack of zpool recovery mechanisms, too, during my 
AVS testing.  But I don't have the know-how to judge if it has technical 
reasons.

 -We had tons of kernel panics because of ZFS.
 Here a reboot must be planned with a couple of weeks in advance
 and done only at saturday night ..
   
Well, I'm sorry, but if your datacenter runs into problems when a single 
server isn't available, you probably have much worse problems. ZFS is a 
file system. It's not a substitute for hardware trouble or a misplanned 
infrastructure. What would you do if you had the fsck you mentioned 
earlier? Or with another file system like UFS, ext3, whatever? Boot a 
system into single user mode and fsck several terabytes, after planning 
it a couple of weeks in advance?

 -Our 9900V and HP EVAs works really BAD with ZFS because of large cache.
 (echo zfs_nocacheflush/W 1 | mdb -kw) did not solve the problem. Only helped 
 a bit.

   
Use JBODs. Or tell the cache controllers to ignore the flushing 
requests. Should be possible, even the $10k low-cost StorageTek arrays 
support this.

 -ZFS performs badly with a lot of small files.
 (about 20 times slower that UFS with our millions file rsync procedures)

   
I have large Sybase database servers and file servers with billions of 
inodes running using ZFSv3. They are attached to X4600 boxes running 
Solaris 10 U3, 2x 4 GBit/s dual FibreChannel, using dumb and cheap 
Infortrend FC JBODs (2 GBit/s) as storage shelves. During all my 
benchmarks (both on the command line and within applications) show that 
the FibreChannel is the bottleneck, even with random read. ZFS doesn't 
do this out of the box, but a bit of tuning helped a lot.

 -ZFS+FC JBOD:  failed hard disk need a reboot :(
 (frankly unbelievable in 2007!)
   
No. Read the thread carefully. It was mentioned that you don't have to 
reboot the server, all you need to do is pull the hard disk. Shouldn't 
be a problem, except if you don't want to replace the faulty one anyway. 
No other manual operations will be necessary, except for the final zfs 
replace. You could also try cfgadm to get rid of ZFS pool problems, 
perhaps it works - I'm not sure about this, because I had the idea 
*after* I solved that problem, but I'll give it a try someday.

 Anyway we happily use ZFS on our new backup systems (snapshotting with ZFS is 
 amazing), but to tell you the true we are keeping 2 large zpool in sync on 
 each system because we fear an other zpool corruption.

   
May I ask how you accomplish that?

And why are you doing this? You should replicate your zpool to another 
host, instead of mirroring locally. Where's your redundancy in that?

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, 
Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss 
Aufsichtsratsvorsitzender: Michael Scheeren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O freeze after a disk failure

2007-09-12 Thread Ralf Ramge
Gino wrote:
 The real problem is that ZFS should stop to force kernel panics.

   
I found these panics very annoying, too. And even more that the zpool 
was faulted afterwards. But my problem is that when someone asks me what 
ZFS should do instead, I have no idea.

 I have large Sybase database servers and file servers
 with billions of 
 inodes running using ZFSv3. They are attached to
 X4600 boxes running 
 Solaris 10 U3, 2x 4 GBit/s dual FibreChannel, using
 dumb and cheap 
 Infortrend FC JBODs (2 GBit/s) as storage shelves.
 

 Are you using FATA drives?

   
Seagate FibreChannel drives, Cheetah 15k, ST3146855FC for the databases.

For the NFS filers we use Infortrend FC shelves with SATA inside.

 During all my 
 benchmarks (both on the command line and within
 applications) show that 
 the FibreChannel is the bottleneck, even with random
 read. ZFS doesn't 
 do this out of the box, but a bit of tuning helped a
 lot.
 

 You found and other good point.
 I think that with ZFS and JBOD, FC links will be soon the bottleneck.
 What tuning have you done?

   
That depends on the indivdual requirements of each service. Basically, 
we change to recordsize according to the transaction size of the 
databases and, on the filers, the performance results were best when the 
recordsize was a bit lower than the average file size (average file size 
is 12K, so I set a recordsize of 8K). I set a vdev cache size of 8K and 
our databases worked best with a vq_max_pending of 32. ZFSv3 was used, 
that's the version which is shipped with Solaris 10 11/06.

 It is a problem if your apps hangs waiting for you to power down/pull out the 
 drive!
 Almost in a time=money environment :)

   
Yes, but why doesn't your application fail over to a standby? I'm also 
working in a time is money and failure no option environment, and I 
doubt I would sleep better if I  were responsible for an application 
under such a service level agreement without full high availability. If 
a system reboot can be a single point of failure, what about the network 
infrastructure? Hardware errors? Or power outages?
I'm definitely NOT some kind of know-it-all, don't misunderstand me. 
Your statement just let my alarm bells ring and that's why I'm asking.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, 
Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss 
Aufsichtsratsvorsitzender: Michael Scheeren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mirrored zpool across network

2007-08-22 Thread Ralf Ramge
 degraded state may potentially last longer than 
a weekend and when you're directly responsible for the mail of millions 
of user and you know that any non-availability will place your name on 
Slashdot (or the name of your CEO, wich equals placing your head on a 
scaffold), I'm sure you'll think twice about using ZFS with AVS or 
letting the linux dudes continue to play with their inefficient boxes :-)

 But if a disaster happened on the primary node, and a decision was 
 made to ZFS import the storage pool on the secondary, ZFS will detect 
 the inconsistency, mark the drive as failed, swap in the secondary HSP 
 disk. Later, when the primary site comes back, and a reverse 
 synchronization is done to restore writes that happened on the 
 secondary, the primary ZFS file system will become aware that a HSP 
 swap occurred, and continue on right where the secondary node left off.
I'll try that as soon as I have a chance again (which means: as soon as 
Sun gets the Sun Cluster working on a X4500).

 c) You *must* force every single `zfs import zpool` on the secondary
 host. Always.

 Correct, but this is the case even without AVS! If one configured ZFS 
 on SAN based storage and your primary node crashed, one would need to 
 force every single `zfs import zpool`. This is not an AVS issue, but 
 a ZFS protection.
Right. Too bad ZFS reacts this way.

I have to admit that you made me nervous once, when you wrote that 
forcing zpool imports would be a bad idea ...

[X] Zfsck now! Let's organize a petition. :-)

 Correct, but this is the case even without AVS! Take the same SAN 
 based storage scenario above, go to a secondary system on your SAN, 
 and force every single `zfs import zpool`.

Yes, but on a SAN, I don't have to worry about zpool inconsistency, 
because the zpool always resides on the same devices.

 In the case of a SAN, where the same physical disk would be written to 
 by both hosts, you would likely get complete data loss, but with AVS, 
 where ZFS is actually on two physical disk, and AVS is tracking 
 writes, even if they are inconsistent writes, AVS can and will recover 
 if an update sync is done.
My problem is that there's no ZFS mechanism which allows me to verify 
the zpool consistency before I actually try to import it. Like I said 
before: AVS does it right, just ZFS doesn't (and otherwise it wouldn't 
make sense to discuss it on this mailinglist anyway :-) ).

It could really help me with AVS if there was something like zpool 
check zpool, something for checking a zpool before an import. I could 
do a cronjob which puts the secondary host into logging mode, run a 
zpool check and continue with the replication  a few hours afterwards. 
Would let me sleep better and I wouldn't have to pray to the IT gods 
before an import. ou know, I saw literally *hundreds* of kernel panics 
during my tests, that made me nervous. I have scripts which do the job 
now, but I saw the risks and the things which can go wrong if someone 
else without my experience does it (like the infamous forgetting to 
manually place the secondary in the logging mode before trying to import 
a zpool).

 Your are quite correct in that although ZFS is intuitively easy to 
 use, AVS is painfully complex. Of course the mindset of AVS and ZFS 
 are as distant apart as they are in the alphabet. :-O

AVS was easy to learn and isn't very difficult to work with. All you 
need is 1 or 2 months of testing experience. Very easy with UFS.

 With AVS in Nevada, there is now an opportunity for leveraging the 
 ease of use of ZFS, with AVS. Being also the iSCSI Target project 
 lead, I see a lot of value in the ZFS option set shareiscsi=on, to 
 get end users in using iSCSI.

Too bad the X4500 has too few PCI slots to consider buying iSCSI cards. 
The two existing slots are already needed for the Sun Cluster 
interconnect. I think iSCSI won't be real option unless the servers are 
shipped with it onboard, like it has been done in the past with the SCSI 
or ethernet interfaces.

 I would like to see set replication=AVS:secondary host, 
 configuring a locally named ZFS storage pool to the same named pair on 
 some remote host. Starting down this path would afford things like ZFS 
 replication monitoring, similar to what ZFS does with each of its own 
 vdevs.
Yes! Jim, I think we'll become friends :-) Who do I have to send the 
bribe money to?

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, 
Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss 
Aufsichtsratsvorsitzender: Michael Scheeren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is ZFS efficient for large collections of small files?

2007-08-21 Thread Ralf Ramge
Brandorr wrote:
 Is ZFS efficient at handling huge populations of tiny-to-small files -
 for example, 20 million TIFF images in a collection, each between 5
 and 500k in size?

 I am asking because I could have sworn that I read somewhere that it
 isn't, but I can't find the reference.
   
If you're worried about the I/O throughput, you should avoid RAIDZ1/2 
configurations. random read performance will be desastrous if you do; 
I've seen random reads ratios with less than 1 MB/s on a X4500 with 40 
dedicated disks for data storage. If you don't have to worry about disk 
space, use mirrors;  I got my best results during my extensive X4500 
benchmarking sessions, when I mirrored single slices instead of complete 
disks (resulting in 40 2-way-mirrors on 40 physical discs, mirroring 
c0t0d0s0-c0t1d0s1 and c0t1d0s0-c0t0d0s1, and so on). If you're worried 
about disk space,  you should consider striping several instances of 
RAIDZ1 arrays, each one consisting of three discs or slices. sequential 
access will  go down the cliff,  but random reads will be boosted.

You should also adjust the recordsize. Try to measure the average I/O 
transaction size. There's a good chance that your I/O performance will 
be best if you set your recordsize to a smaller value. For instance, if 
your average file size is 12 KB, try using 8K or even 4K recordsize, 
stay away from 16K or higher.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, 
Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss 
Aufsichtsratsvorsitzender: Michael Scheeren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mirrored zpool across network

2007-08-21 Thread Ralf Ramge
Torrey McMahon wrote:
 AVS?
   
Jim Dunham will probably shoot me, or worse, but I recommend thinking 
twice about using AVS for ZFS replication. Basically, you only have a 
few options:

 1) Using a battery buffered hardware RAID controller, which leads to 
bad ZFS performance in many cases,
 2) Buildung up Three-Way-Mirrors to avoid complete data loss in several 
desaster scenarios due to missing ZFS recovery mechanisms like `fsck`, 
which makes AVS/ZFS based solutions quite expensive,
 3) Additionally using another form of backup, e.g. tapes.

For instance, one scenario which made me think: Imagine you have a 
X4500. 48 internal disks, 500 GB each. This would lead to ZFS pool on 40 
disks (you need 1 for the system, plus 3x RAID 10 for the bitmap 
volumes, otherwise your performance will be very bad, plus 2x HSP). 
Using 40 disks leads to a total of 40 separate replications. Now imagine 
the following two scenarios:

a) A disk in the primary fails. What happens? A HSP jumps in and 500 GB 
will be rebuilt. These 500 GB are synced over a single 1 GBit/s 
crossover cable. This takes a bit of time and is 100% unnecessary - and 
it will become much worse in the future, because the disk capacities 
rocket up into the sky, while the performance isn't improved as much. 
During this time, your service misses redundancy. And we're not talking 
about some minutes during this time. Well, and now try to imagine what 
will happen if another disks fails during this rebuild, this time in the 
secondary ...

b) A disk in the secondary fails. What happens now? No HSP will jump in 
on the secondary, because the zpool isn't imported and ZFS doesn't know 
about the failure.  Instead, you'll end up with 39 active replications 
instead of 40. The one which replicates to the failed drive will become 
inactive. But ... oh damn, the zpool isn't mounted on the secondary 
host, so ZFS doesn't report the drive failure to our server monitoring. 
That can be funny. The only way to get aware of the problem I found 
after a minute of thinking was asking sndradm about the health status - 
which would lead to a  false alarm on Host A, because the failed disc is 
in Host B, and operators are usually not bright enough to change the 
disc in Host B after they get notified about a problem on Host B. But 
even if everything works,  what will if the primary fails before an 
administrator fixed the problem, the missing replication is running 
again and the replacement disc has been completely synced? Hello, 
kernel panic, and Goodbye, 12 TB of data).

c) You *must* force every single `zfs import zpool` on the secondary 
host. Always. Because you usually need your secondary host after your 
primary crashed. You won't have the chance to export your zpool on the 
primary first - and if you do, you don't need AVS at all. Bring some 
Kleenex to get rid of the sweat on your forehead when you have to switch 
to your secondary host, because a single mistake (like forgetting to put 
the secondary host into logging mode manually before you try to import 
the zpool) will lead to a complete data loss. I bet you won't even trust 
your own failover scripts.

Use AVS and ZFS together. I use it myself. But I made sure that I know 
what I'm doing. Most people probably don't.

Btw: I have to admit that I haven't tried the newst nevada builds during 
the tests. It's possible that AVS and ZFS work better together than they 
did under Solaris 10 11/06 and AVS 4.0. But there's a reason I haven't 
tried. It's because Sun Cluster 3.2 instantly crashes on Thumpers, 
SATA-related kernel panics, and the OpenHA Cluster isn't available yet.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, 
Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss 
Aufsichtsratsvorsitzender: Michael Scheeren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ? ZFS dynamic striping over RAID-Z

2007-08-02 Thread Ralf Ramge
Tim Thomas wrote:
 if I create a storage pool with multiple RAID-Z stripes in it does ZFS 
 dynamically stripe data across all the RAID-Z stripes in the pool 
 automagically ?

 If I relate this back to my storage array experience, this would be 
 Plaiding which is/was creating a RAID-0 logical volume across multiple 
 h/ware RAID-5 stripes.
   
I did this one week ago, while trying to get at least bit of random read
performance out of a X4500. Normal RAIDZ(2) performance has been between
0.8 and 5 MB/s, which was way too slow for our needs, so I used striped
RAIDZ to get a boost.

My test configuration:

c5t0/t4: system (mirrored)
c5t1/t5 - c5t3/t7: AVS bitmap volumes (mirrored)

This left 40 disks. I've created 13x RAIDZ with 3 disks each, that's 39
disks in total. The script I used is appended below.

And yes, it results in striped RAIDZ arrays (I call it RAIDZ0). And my
data throughput was 13 times higher, as expected.

Hope this will help you a bit.

---
#!/bin/sh

/usr/sbin/zpool create -f big raidz c0t0d0s0 c1t0d0s0 c4t0d0s0 spare
c7t7d0s0
/usr/sbin/zpool add -f big raidz c6t0d0s0 c7t0d0s0 c0t1d0s0
/usr/sbin/zpool  add -f big raidz c1t1d0s0 c4t1d0s0 c6t1d0s0
/usr/sbin/zpool  add -f big raidz c7t1d0s0 c0t2d0s0 c1t2d0s0
/usr/sbin/zpool  add -f big raidz c4t2d0s0 c6t2d0s0 c7t2d0s0
/usr/sbin/zpool  add -f big raidz c0t3d0s0 c1t3d0s0 c4t3d0s0
/usr/sbin/zpool  add -f big raidz c6t3d0s0 c7t3d0s0 c0t4d0s0
/usr/sbin/zpool  add -f big raidz c1t4d0s0 c4t4d0s0 c6t4d0s0
/usr/sbin/zpool  add -f big raidz c7t4d0s0 c0t5d0s0 c1t5d0s0
/usr/sbin/zpool  add -f big raidz c4t5d0s0 c6t5d0s0 c7t5d0s0
/usr/sbin/zpool  add -f big raidz c0t6d0s0 c1t6d0s0 c4t6d0s0
/usr/sbin/zpool  add -f big raidz c6t6d0s0 c7t6d0s0 c0t7d0s0
/usr/sbin/zpool  add -f big raidz c1t7d0s0 c4t7d0s0 c6t7d0s0

/usr/sbin/zpool status
---


-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas 
Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [AVS] Question concerning reverse synchronization of a zpool

2007-07-12 Thread Ralf Ramge
Ralf Ramge wrote:
 Questions:

 a) I don't understand why the kernel panics at the moment. the zpool 
 isn't mounted on both systems, the zpool itself seems to be fine after a 
 reboot ... and switching the primary and secondary hosts just for 
 resyncing seems to force a full sync, which isn't an option.

 b) I'll try a sndradm -m -r the next time ... but I'm not sure if I 
 like that thought. I would accept this if I replaced the primary host 
 with another server, but having to do a 24 TB full sync just because the 
 replication itself had been disabled for a few minutes would be hard to 
 swallow. Or did I do something wrong?

   
I've answered these questions myself at the meantime (with a nice 
employee fo Sun Hamburg giving me the hint). For Google: during a 
reverse sync, neither side of the replication is allowed to have the 
zpool imported, because after the reverse sync finishes, SNDR enters 
replication mode. This renders reverse syncs useless for HA scenarios, 
switch primary  secondary instead.

 c) What performance can I expect from a X4500, 40 disks zpool, when 
 using slices, compared to LUNs? Any experiences?

   
Any input to the question will still be appreciated :-)

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, 
Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss 
Aufsichtsratsvorsitzender: Michael Scheeren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] [AVS] Question concerning reverse synchronization of a zpool

2007-07-11 Thread Ralf Ramge
Hi,

I'm struggling to get a stable ZFS replication using Solaris 10 110/06 
(actual patches) and AVS 4.0  for several weeks now. We tried it on 
VMware first and ended up in kernel panics en masse (yes, we read Jim 
Dunham's blog articles :-). Now we try on the real thing, two X4500 
servers. Well, I have no trouble replicating our kernel panics there, 
too ... but I think I learned some important things, too. But one 
problem is still remaining.

I have a zpool on host A. Replication to host B works fine.

* zpool export tank on the primary - works.
* sndradm -d on both servers - works (paranoia mode)
* zpool import id on the secondary - works.

So far, so good. I chance the contents of the file system, add some 
files, delete some others ... no problems. The secondary is in 
production use now, everything is fine.

Okay, let's imagine I switched to the secodary host because had a 
problem with the primary. Now it's repaired, now I want my redundancy back.

* sndradm -E -f  on both hosts - works.
* sndradm -u -r on the primary for refreshing the primary - works. 
`nicstat` shows me a bit of traffic.

Good, let's switch back to the primary. Actual status: zpool is imported 
on the secondary and NOT imported on the primary.

* zpool export tank on the secondary - *kernel panic*

Sadly, the machine dies fast, I don't see the kernel panic with `dmesg`. 
And disabling the replication again later and mounting the zpool on the 
primary again shows me that the update sync didn't take place, the file 
system changes I did on the secondary wren't replicated. Exporting the 
zpool on the secondary works *after* the system rebooted.

I uses slices for the zpool, not LUNs, because I think many of my 
problems were caused by exclusive locking, but it doesn't help with this 
one.

Questions:

a) I don't understand why the kernel panics at the moment. the zpool 
isn't mounted on both systems, the zpool itself seems to be fine after a 
reboot ... and switching the primary and secondary hosts just for 
resyncing seems to force a full sync, which isn't an option.

b) I'll try a sndradm -m -r the next time ... but I'm not sure if I 
like that thought. I would accept this if I replaced the primary host 
with another server, but having to do a 24 TB full sync just because the 
replication itself had been disabled for a few minutes would be hard to 
swallow. Or did I do something wrong?

c) What performance can I expect from a X4500, 40 disks zpool, when 
using slices, compared to LUNs? Any experiences?

And another thing: I did some experiments with zvols, because I wanted 
to make desasters and the AVS configuration itself easier to handle - 
there won't be a full sync after replacing a disk because AVS doesn't 
see that a hot spare is being used, and hot spares won't be replicated 
to the secondary host as well although the original drive on the 
secondary never failed.  I used the zvol with UFS and this kind of 
hardware RAID controller emulation by ZFS works pretty well, just the 
performance went down the cliff. Sunsolve told me that this is a 
flushing problem and there's a workaround in Nevada build 53 and higher. 
Has somebody done a comparison, can you share some experiences? I only 
have a few days left and I don't waste time on installing Nevada for 
nothing ...

Thanks,

  Ralf

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
[EMAIL PROTECTED] - http://web.de/

11 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, 
Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss 
Aufsichtsratsvorsitzender: Michael Scheeren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss