from:"Frank Middleton"

Re: [zfs-discuss] non-ECC Systems and ZFS for home users

2010-09-24 Thread Frank Middleton


On 09/23/10 19:08, Peter Jeremy wrote:


The downsides are generally that it'll be slower and less power-
efficient that a current generation server and the I/O interfaces will
be also be last generation (so you are more likely to be stuck with
parallel SCSI and PCI or PCIx rather than SAS/SATA and PCIe).  And
when something fails (fan, PSU, ...), it's more likely to be customised
in some way that makes it more difficult/expensive to repair/replace.


Sometimes the bargains on E-Bay are such that you can afford to get
2 or even a 3rd machine for spares, and a PCI-X SAS card has more
than adequate performance for SOHO use. But, I agree, repair is
probably impossible unless you can simply swap in a spare part from
another box. However server class machines are pretty tough. My used
Sun hardware has yet to drop a beat and they've been running 24*7
for years - well, I cycle the spares since they were never needed for
parts, so it's less than that. But they are noisy...

Surely the issue about repairs extends to current generation hardware.
It gets obsolete so quickly that finding certain parts (especially mobos)
may be next to impossible. So what's the difference other than lots of $$$?

Cheers -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] non-ECC Systems and ZFS for home users

2010-09-23 Thread Frank Middleton


On 09/23/10 03:01, Ian Collins wrote:


So, I wonder - what's the recommendation, or rather, experience as far
as home users are concerned? Is it safe enough now do use ZFS on
non-ECC-RAM systems (if backups are around)?


It's as safe as running any other OS.

The big difference is ZFS will tell you when there's a corruption. Most
users of other systems are blissfully unaware of data corruption!


This runs you into the possibility of perfectly good files becoming inaccessible
due to bad checksums being written to all the mirrors. As Richard Elling
wrote some time ago in [zfs-discuss] You really do need ECC RAM, see
http://www.cs.toronto.edu/%7Ebianca/papers/sigmetrics09.pdf. There
were a couple of zfs-discuss threads quite recently about memory problems
causing serious issues. Personally, I wouldn't trust any valuable data to any
system without ECC, regardless of OS and file systems. For home use, used
Suns are available at ridiculously low prices and they seem to be much better
engineered than your typical PC. Memory failures are much more likely than
winning the pick 6 lotto...

FWIW Richard helped me diagnose a problem with checksum failures on
mirrored drives a while back and it turned out to be the CPU itself getting
the actual checksum wrong /only on one particular file/, and even then only
when the ambient temperature was high. So ZFS is good at ferreting out
obscure hardware problems :-).

Cheers -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] resilver of older root pool disk

2010-09-23 Thread Frank Middleton


Bumping this because no one responded. Could this be because
it's such a stupid question no one wants to stoop to answering it,
or because no one knows the answer? Trying to picture, say, what
could happen in /var (say /var/adm/messages), let alone a swap
zvol, is giving me a headache...

On 07/09/10 17:00, Frank Middleton wrote:

This is a hypothetical question that could actually happen:

Suppose a root pool is a mirror of c0t0d0s0 and c0t1d0s0
and for some reason c0t0d0s0 goes off line, but comes back
on line after a shutdown. The primary boot disk would then
be c0t0d0s0 which would have much older data than c0t1d0s0.

Under normal circumstances ZFS would know that c0t0d0s0
needs to be resilvered. But in this case c0t0d0s0 is the boot
disk. Would ZFS still be able to correctly resilver the correct
disk under these circumstances? I suppose it might depend
on which files, if any, had actually changed...

Thanks -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] carrying on [was: Legality and the future of zfs...]

2010-07-19 Thread Frank Middleton


On 07/19/10 07:26, Andrej Podzimek wrote:


I run ArchLinux with Btrfs and OpenSolaris with ZFS. I haven't had a
serious issue with any of them so far.


Moblin/Meego ships with btrfs by default. COW file system on a
cell phone :-). Unsurprisingly for a read-mostly file system it
seems pretty stable. There's an interesting discussion about btrfs
on Meego at http://lwn.net/Articles/387196/


Undoubtedly, ZFS is currently much more mature and usable than Btrfs.


Agreed, but it's not just ZFS, though. It's the packaging system, beadm,
stmf, the whole works. A simple yum update can be a terrifying experience
and almost impossible to undo. And updating to a major new Linux release?
Almost as bad as updating MSWindows. Open Solaris as an administerable
system is simply years ahead of anything else.


However, Btrfs can evolve very quickly, considering the huge community
around Linux. For example, EXT4 was first released in late 2006 and I
first deployed it (with a stable on-disk format) in early 2009.


But the infrastructure to make use of a ZFS-like manager simply isn't
there. As a Linux and Solaris developer and user of both, I'd take Solaris
any day and so would everyone I know. But going back to the original
topic, the tea leaves seem to be saying that Oracle is interested primarily
in Solaris as a robust server OS and probably not so much for the desktop
where there realistically isn't going to be much revenue. But it would be
a bad gamble if they lose a lot of mind-share. Legal issues over ZFS make
it even worse. I get calls for help converting MSWindows applications and
servers to Linux. ZFS and all the other goodies make a compelling case
for Solaris (and Sun/Oracle hardware) instead but the uncertainties make
it a hard sell. Oracle are you listening?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Move Fedora or Windows disk image to ZFS (iScsi Boot)

2010-07-19 Thread Frank Middleton


On 07/18/10 17:39, Packet Boy wrote:


What I can not find is how to take an existing Fedora image and copy
the it's contents into a ZFS volume so that I can migrate this image
from my existing Fedora iScsi target to a Solaris iScsi target (and
of course get the advantages of having that disk image hosted on
ZFS).

Do I just zfs create -V and then somehow dd the Fedora .img file on
top of the newly created volume?


Well, you could simply mount the iscsi devices and choose any method
that is suitable to copy the existing volume. For example Fedora will
create /dev/sd* for each iscisi device it knows about, so you see an
empty drive at that point and the problem simply devolves to whatever
you would do if you wanted to use a new physical drive. nftsclone
works for MSWindows, I suppose dd might work for Linux, although
the disk geometries would have to be identical and you'd have to
copy the entire disk. It might be safer to create new file systems
on the new disk, and use cpio or even tar to copy everything. Shame
it's so hard to do mirroring with Fedora, so the ZFS mirror trick
might be too difficult.


I've spent hours and have not been able to find any example on how to
do this.


Making the new drive bootable is the real problem since it will probably
not have the same identifier. For sure you'd have to edit grub ion the
new drive and perhaps run grub interactively to install a boot loader.

Hope this helps -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] resilver of older root pool disk

2010-07-09 Thread Frank Middleton


This is a hypothetical question that could actually happen:

Suppose a root pool is a mirror of c0t0d0s0 and c0t1d0s0
and for some reason c0t0d0s0 goes off line, but comes back
on line after a shutdown. The primary boot disk would then
be c0t0d0s0 which would have much older data than c0t1d0s0.

Under normal circumstances ZFS would know that c0t0d0s0
needs to be resilvered. But in this case c0t0d0s0 is the boot
disk. Would ZFS still be able to correctly resilver the correct
disk under these circumstances? I suppose it might depend
on which files, if any,  had actually changed...

Thanks -- Frank


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs/lofi/share panic

2010-05-30 Thread Frank Middleton


On 05/27/10 05:16 PM, Dennis Clarke wrote:


I just tried this with a UFS based filesystem just for a lark.


It never failed on UFS, regardless of the contents of /etc/dfs/dfstab.


Guess I must now try this with a ZFS fs under that iso file.


Just tried it again with b134  *with* share /mnt in /etc/dfs/dfstab.

# mount -O -F hsfs /export/iso_images/moblin-2.1-PR-Final-ivi-201002090924.img 
/mnt
# ls /mnt
isolinux  LiveOS
# unshare /mnt
/mnt: path doesn't exist
# share /mnt
# unshare /mnt
# share /mnt

Panic ensues (the following observed on the serial console); note that
the dataset is not UFS!

# May 30 13:35:44 host5 ufs: NOTICE: mount: not a UFS magic number (0x0)

panic[cpu1]/thread=30001f5f560: BAD TRAP: type=31 rp=2a1014769a0 addr=218 mmu_fsr=0 
occurred in module nfssrv due to a NULL pointer dereference

Tried again after it rebooted

Edited /etc/dfs/dfstab  to remove the share /mnt
# unshare /mnt
# mount -O -F hsfs /backups/icon/moblin-2.1-PR-Final-ivi-201002090924.img /mnt
# ls /mnt
isolinux  LiveOS
# unshare /mnt
/mnt: bad path
# share /mnt
# unshare  /mnt
# share /mnt

No panic. So the problem all along appears to be what happens if you
mount -O to an already shared mountpoint. Deliberately sharing before
mounting (but with nothing in /etc/dfs/dfstab) resulted in a slightly
different panic (more like the ones documented in the CR):

panic[cpu1]/thread=30002345e0: BAD TRAP: type=34 rp=2a100f84460 
addr=ff6f6c2f5267 mmu_fsr=0

unshare: alignment error:

So CR6798273 should be amended to show the following:

To reproduce, share (say) /mnt
mount -O some-image-file /mnt
share /mnt
unshare /mnt
share/mnt
unshare ./mnt
Highly reproducible panic ensues.

Workaround - make sure mountpoints are not shared before
mounting iso images stored on a ZFS dataset.

So the problem, now seen to be relatively trivial, isn't fixed. at least
in b134. For all of you who responded both off and on the list and
motivated this experiment,  much thanks. Perhaps someone with
access to a more recent build could try this, and if it still happens,
update and reopen CR6798273, although it doesn't seem very
important now.

Regards -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zfs/lofi/share panic

2010-05-24 Thread Frank Middleton


Many many moons ago, I submitted a CR into bugs about a
highly reproducible panic that occurs if you try to re-share
a  lofi mounted image. That CR has AFAIK long since
disappeared - I even forget what it was called.

This server is used for doing network installs. Let's say
you have a 64 bit iso lofi-mounted and shared. You do the
install, and then wish to switch to a 32 bit iso. You unshare,
umount, delete the loopback, and then lofiadm the new iso,
mount it and then share it. Panic, every time.

Is this such a rare use-case that no one is interested? I have
the backtrace and cores if anyone wants them, although
such were submitted with the original CR. This is pretty
frustrating since you start to run out of ideas for mountpoint
names after a while unless you forget and get the panic.

FWIW (even on a freshly booted system after a panic)
# lofiadm zyzzy.iso /dev/lofi/1
# mount -F hsfs /dev/lofi/1 /mnt
mount: /dev/lofi/1 is already mounted or /mnt is busy
# mount -O -F hsfs /dev/lofi/1 /mnt
# share /mnt
#

If you unshare /mnt and then do this again, it will panic.
This has been a bug since before Open Solaris came out.

It doesn't happen if the iso is originally on UFS, but
UFS really isn't an option any more.  FWIW the dataset
containing the isos has the sharenfs attribute set,
although it doesn;t have to be actually mounted by
any remote NFS for this panic to occur.

Suggestions for a workaround most welcome!

Thanks

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sharing with zfs

2010-05-04 Thread Frank Middleton


On 05/ 4/10 05:37 PM, Vadim Comanescu wrote:

Im wondering is there a way to actually delete a zvol ignoring the fact
that it has attached LU?


You didn't say what version of what OS you are running. As of b134
or so it seems to be impossible to delete a zfs iscsi target. You might
look at the thread: [zfs-discuss] How to destroy iscsi dataset?,
however it never really came to any really satisfying conclusion.

AFAIK the only way to delete a zfs iscsi target is to boot b132 or
earlier in single user mode. IIRC there are iscsigt and COMSTAR
changes coming in later releases so it night be worth trying again
when we eventually get to go past b134.

HTH -- Frank



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD best practices

2010-04-21 Thread Frank Middleton


On 04/20/10 11:06 AM, Don wrote:


Who else, besides STEC, is making write optimized drives and what
kind of IOP performance can be expected?


Just got a distributor email about Texas Memory Systems'  RamSan-630,
one of a range of huge non-volatile SAN products they make. Other
than that this has a capacity of 4-10TB, looks like a 4U, and consumes
an amazing 450W, I don't know anything about them. The iops are
pretty impressive, but power-wise, at 45W/TB even mirrored disks
use quite a bit less power. But 500K random iops and 8GB/s might
be worth it if the specs are to be believed...





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Making an rpool smaller?

2010-04-16 Thread Frank Middleton


On 04/16/10 07:41 PM, Brandon High wrote:


1. Attach the new drives.
2. Reboot from LiveCD.
3. zpool create new_rpool on the ssd


Is step 2 actually necessary? Couldn't you create a new BE

# beadm create old_rpool
# beadm activate old_rpool
# reboot
# beadm delete rpool

It's the same number of steps but saves the bother of making
a zpool version compatible live cd. Also, how attached are you
the pool name rpool? I have systems with root pools called spool,
tpool, etc., even one rpool-1 (because the text installer detected
an earlier rpool on an iscsi volume I was overwriting) and they
all seem to work fine.

Actually. my preferred method (if you really want the new pool
to be called rpool) would be to do the 4 step rename on the ssd
after all the other steps are done and you've sucessfully booted it.
Then you always have the untouched old disk in case you mess up.

Also, (gurus please correct here), you might need to change
step 3 to something like

# zpool create -f -o failmode=continue -R /mnt -m legacy rpool ssd
in which case you can recv to it without rebooting at all, and
#zpool set bootfs =...

You might also consider where you want swap to be and make sure
that vfstab is correct on the old disk now that the root pool has
a different name. There was detailed documentation on how to zfs
send/recv root pools on the Sun ZFS documentation site, but right
now it doesn't seem to be Googleable. I'm not sure your original
set of steps will work without at least doing the above two.

You might need to check to be sure the ssd has an SMI label.

AFAIK the official syntax for installing the MBR is
# installboot -F zfs /usr/platform/`uname -i`/lib/fs/zfs/bootblk /dev/rdsk/ssd

Finally, you should check or delete /etc/zfs/zpool.cache because
it will likely be incorrect on the ssd after recv'ing the snapshot.

HTH -- Frank



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Making an rpool smaller?

2010-04-16 Thread Frank Middleton


On 04/16/10 08:57 PM, Frank Middleton wrote:


AFAIK the official syntax for installing the MBR is
# installboot -F zfs /usr/platform/`uname -i`/lib/fs/zfs/bootblk
/dev/rdsk/ssd


Sorry, that's for SPARC. You had the installgrub down correctly...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Making an rpool smaller?

2010-04-16 Thread Frank Middleton


On 04/16/10 09:53 PM, Brandon High wrote:


Right now, my boot environments are named after the build it's
running. I'm guessing that by 'rpool' you mean the current BE above.


No, I didn't :-(. Please ignore that part - too much caffeine :-).


I figure that by booting to a live cd / live usb, the pool will not be
in use, so there shouldn't be any special steps involved.


Might be the easiest way. But I've never found having a different
name for the root pool to be a problem. The lack, until recently, of
a bootable CD for SPARC may have something to do with living
with different names. Makes it easier to recv snapshots from
different hosts and architectures, too.


I'll try out a few variations on a VM and see how it goes.


You'll need to do the zfs create with legacy mount option, and
set the bootfs property. Otherwise it looks like you are on the
right path.

Cheers -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS RaidZ recommendation

2010-04-07 Thread Frank Middleton


On 04/ 7/10 03:09 PM, Jason S wrote:
 

I was actually already planning to get another 4 gigs of ram for the
box right away anyway, but thank you for mentioning it! As there
appears to be a couple ways to skin the cat here i think i am going
to try both a 14 spindle RaidZ2 and 2 X 7 RaidZ2 configuration and
see what the performance is like. I have a fews days of grace before
i need to have this server ready for duty.


Just curious, what are you planning to boot from? AFAIK you can't
boot ZFS from anything much more complicated than a mirror.

Cheers -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-04 Thread Frank Middleton


On 04/ 4/10 10:00 AM, Willard Korfhage wrote:


What should I make of this? All the disks are bad? That seems
unlikely. I found another thread

http://opensolaris.org/jive/thread.jspa?messageID=399988

where it finally came down to bad memory, so I'll test that. Any
other suggestions?


It could be the cpu. I had a very bizarre case where the cpu would
sometimes miscalculate the checksums of certain files and mostly
when the cpu was also  busy doing other things. Probably the cache.

Days of running memtest and SUNWvts didn't result in any errors
because this was a weirdly pattern sensitive problem. However, I
too am of the opinion that you shouldn't even think of running zfs
without ECC memory (lots of threads about that!) and that this
is far, far more likely to be your problem, but I wouldn't count on
diagnostics finding it, either. Of course it could be the controller too.

For laughs, the cpu calculating bad checksums was discussed in
http://opensolaris.org/jive/message.jspa?messageID=469108
(see last message in the thread).

If you are seriously contemplating using a system with
non-ECC RAM, check out the Google research mentioned in
http://opensolaris.org/jive/thread.jspa?messageID=423770
http://www.cs.toronto.edu/%7Ebianca/papers/sigmetrics09.pdf

Cheers -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool split problem?

2010-03-31 Thread Frank Middleton


On 03/31/10 12:21 PM, lori.alt wrote:


The problem with splitting a root pool goes beyond the issue of the
zpool.cache file. If you look at the comments for 6939334
http://monaco.sfbay.sun.com/detail.jsf?cr=6939334, you will see other
files whose content is not correct when a root pool is renamed or split.


6939334 seems to be inaccessible outside of Sun. Could you
list the comments here?

Thanks
 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] How to destroy iscsi dataset?

2010-03-30 Thread Frank Middleton


Our backup system has a couple of datasets used for iscsi
that have somehow lost their baseline snapshots with the
live system. In fact zfs list -t snapshots doesn't show
any snapshots at all for them. We rotate backup and live
every now and then, so these datasets have been shared
at some time.

Therefore an incremental zfs send/recv will fail for
these datasets. The send script automatically uses
a non-incremental send if the target dataset is missing,
so all I need to do is somehow destroy them.

# svcs -a | grep iscsi
disabled   18:50:21 svc:/network/iscsi_initiator:default
disabled   18:50:34 svc:/network/iscsi/target:default
disabled   18:50:38 svc:/system/iscsitgt:default
disabled   18:50:39 svc:/network/iscsi/initiator:default
# zfs list  space/os-vdisks/osolx86
NAME  USED  AVAIL  REFER  MOUNTPOINT
space/os-vdisks/osolx8620G   657G  14.9G  -
# zfs get shareiscsi space/os-vdisks/osolx86
NAME PROPERTYVALUE   SOURCE
space/os-vdisks/osolx86  shareiscsi  off local
# zfs destroy -f space/os-vdisks/osolx86
cannot destroy 'space/os-vdisks/osolx86': dataset is busy

AFAIK they aren't shared in any way now.
How to delete these datasets, or find out why they are busy?

Thanks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] CR 6880994 and pkg fix

2010-03-28 Thread Frank Middleton


Thanks to everyone who made suggestions! This machine has run
memtest for a week and VTS for several days with no errors. It
does seem that the problem is probably in the CPU cache.

On 03/24/10 10:07 AM, Damon Atkins wrote:

You could try copying the file to /tmp (ie swap/ram) and do a
continues loop of checksums


On a variation of your suggestion, I implemented a bash script
that applies sha1sum 10,000 times with a pause of 0.1S between
each attempt, and tests the result against what seemed to be the
correct result.

sha1sum on /lib/libdlpi.so.1 resulted in 11% of incorrect results
sha1sum on /tmp/libdlpi.so.1 resulted in 5 failures out of 10,000
sha1sum on /lib/libpam.so.1 resulted in zero errors in 10,000
sha1sum on /tmp/libpam.so.1ditto.

So what we have is a pattern sensitive failure that is also sensitive
to how busy the cpu is (and doesn't fail running VTS). md5sum and
sha256sum produced similar results, and presumably so would
fletcher2. To get really meaningful results, the machine should be
otherwise idle (but then, maybe it wouldn't fail).

Is anyone willing to speculate (or have any suggestions for further
experiments) about what failure mode could cause a checksum
calculation to be pattern sensitive and also thousands of times
more likely to fail if read from disk vs. tmpfs? FWIW the failures
are pretty consistent, mostly but not always producing the
same bad checksum.

So at boot, the cpu is busy, increasing the probability of this
pattern sensitive failure,  and this one time it failed on every
read of /lib/libdlpi.so.1. With copies=1 this was twice as likely
to happen, and when it did ZFS returned an error on any
attempt to read the file. With copies=2 in this case it doesn't
return an error when attempting to read. Also there were no
set-bit errors this time, but then I have no idea what a set-bit
error is.

On 03/24/10 12:32 PM, Richard Elling wrote:


Clearly, fletcher2 identified the problem.


Ironically, on this hardware it seems it created the problem :-).
However you have been vindicated - it was a pattern sensitive
problem as you have long suggested it might be.

So: that the file is still readable is a mystery, but how it became
to be flagged as bad in ZFS isn't, any more.

Cheers -- Frank


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zpool split problem?

2010-03-27 Thread Frank Middleton


Zpool split is a wonderful feature and it seems to work well,
and the choice of which disk got which name was perfect!
But there seems to be an odd anomaly (at least with b132) .

Started with c0t1d0s0 running b132  (root pool is called rpool)
Attached c0t0d0s0 and waited for it to resilver
Rebooted from c0t0d0s0
zpool split rpool spool
Rebooted from c0t0d0s0, both rpool and spool were mounted
Rebooted from c0t1d0s0, only rpool was mounted

It seems to me for consistency rpool should not have been
mounted when booting from c0t0d0s0; however that's pretty
harmless. But:

Rebooted from c0t0d0s0  - a couple of verbose errors on the console...
# zpool status rpool
  pool: rpool
 state: UNAVAIL
status: One or more devices could not be used because the label is missing
or invalid.  There are insufficient replicas for the pool to continue
functioning.
action: Destroy and re-create the pool from
a backup source.
   see: http://www.sun.com/msg/ZFS-8000-5E
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
rpool UNAVAIL  0 0 0  insufficient replicas
  mirror-0UNAVAIL  0 0 0  insufficient replicas
c0t1d0s0  FAULTED  0 0 0  corrupted data
c0t0d0s0  FAULTED  0 0 0  corrupted data
# zpool status spool
  pool: spool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
spool   ONLINE   0 0 0
  c0t0d0s0  ONLINE   0 0 0

It seems that ZFS thinks c0t0d0s0 is still part of rpool as well as being
a separate pool (spool).

# zpool export rpool
cannot open 'rpool': I/O error

This worked since zpool list doesn't show rpool any more.

Reboot c0t1d0s0 - no problem (no spool)
Reboot c0t0d0s0 - no problem (no rpool)

The workaround seems to be to export rpool the first time you
boot c0t0d0s0. No big deal but it's a bit scary when it happens.
Has this been fixed in a later release?

Thanks -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] CR 6880994 and pkg fix

2010-03-23 Thread Frank Middleton


On 03/22/10 11:50 PM, Richard Elling wrote:
 

Look again, the checksums are different.


Whoops, you are correct, as usual. Just 6 bits out of 256 different...
Last year
expected 4a027c11b3ba4cec bf274565d5615b7b 3ef5fe61b2ed672e ec8692f7fd33094a
actual  4a027c11b3ba4cec bf274567d5615b7b 3ef5fe61b2ed672e ec86a5b3fd33094a
Last Month (obviously a different file)
expected 4b454eec8aebddb5 3b74c5235e1963ee c4489bdb2b475e76 fda3474dd1b6b63f
actual  4b454eec8aebddb5 3b74c5255e1963ee c4489bdb2b475e76 fda354c1d1b6b63f

Look which bits are different -  digits 24, 53-56 in both cases. But comparing
the bits, there's no discernible pattern. Is this an artifact of the algorithm
made by one erring bit always being at the same offset?


don't forget the -V flag :-)


I didn't. As mentioned there are subsequent set-bit errors, (14 minutes
later)  but none for this particular incident. I'll send you the results
separately since they are so puzzling. These 16 checksum failures
on libdlpi.so.1 were the only fmdump -eV entries for the entire boot
sequence except that it started out with one ereport.fs.zfs.data,
whatever that is, for a total of exactly 17 records, 9 in 1 uS, then
8 more 40 mS later, also in 1uS. Then nothing for 4 minutes, one
more checksum failure (bad_range_sets =) then 10 minutes later,
two with the set-bits error, one for each disk. That's it.


o Why is the file flagged by ZFS as fatally corrupted still accessible?


This is the part I was hoping to get answers for since AFAIK this
should be impossible. Since none of this is having any operational
impact, all of these issues are of interest only, but this is a bit scary!


Broken CPU, HBA, bus, memory, or power supply.


No argument there. Doesn't leave much, does it :-). Since the file itself
appears to be uncorrupted, and the metadata is consistent for all 16
entries, it would seem that the checksum calculation itself is failing
because it would appear in this case that everything else is OK. Is there
a way to apply the fletcher2 algorithm interactively as in sum(1)
or cksum(1)  (i.e., outside the scope of ZFS) to see if it is in some way
pattern sensitive with this CPU? Since only a small subset of files is
affected, this should be easy to verify. Start a scrub to heat things
up and then in parallel do checksums in a tight loop...


Transient failures are some of the most difficult to track down. Not all
transient failures are random.


Indeed, although this doesn't seem to be random. The hits to libdlpi.so.1
seems to be quite reproducible as you've seen from the fmdump log,
although I doubt this particular scenario will happen again. Can you
think of any tools to investigate this? I suppose I could extract the
checksum code from ZFS itself to build one, but that would take quite
a lot of time. Is there any documentation that explains the output of
fmdump -eV? What are set-bits, for example?

I guess not...  from man fmdump(1m)

   The error log file contains /Private/  telemetry  informa-
 tion  used  by  Sun's automated diagnosis software.
..

   Each problem recorded in the fault log is identified by:

 oThe time of its diagnosis

So did ZFS really read 8 copies of libdlpi.so.1 within 1uS, wait
40mS and then read another 8 copies in 1uS again? I doubt it :-).
I bet it took  1uS just to (mis)calculate the checksum (1.6GHz
16 bit cpu).

Thanks -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] CR 6880994 and pkg fix

2010-03-22 Thread Frank Middleton


On 03/21/10 03:24 PM, Richard Elling wrote:
 

I feel confident we are not seeing a b0rken drive here.  But something is
clearly amiss and we cannot rule out the processor, memory, or controller.


Absolutely no question of that, otherwise this list would be flooded :-).
However, the purpose of the post wasn't really to diagnose the hardware
but to ask about the behavior of ZFS under certain error conditions.


Frank reports that he sees this on the same file, /lib/libdlpi.so.1, so I'll go 
out
on a limb and speculate that there is something in the bit pattern for that
file that intermittently triggers a bit flip on this system. I'll also 
speculate that
this error will not be reproducible on another system.


Hopefully not, but you never know :-). However, this instance is different.
The example you quote shows both expected and actual checksums to be
the same. This time the expected and actual checksums are different and
fmdump isn't flagging any bad_ranges or set-bits (the behavior you observed
is still happening, but orthogonal to this instance at different times and not
always on this file).

Since file itself is OK, and the expected checksums are always the same,
neither the file nor the metatdata appear to be corrupted, so it appears
that both are making it into memory without error.

It would seem therefore that it is the actual checksum calculation that is
failing. But, only at boot time, the calculated (bad) checksums differ (out
of 16, 10, 3, and 3 are the same [1]) so it's not consistent. At this point it
would seem to be cpu or memory, but why only at boot? IMO it's an
old and feeble power supply under strain pushing cpu or memory to a
margin not seen during normal operation, which could be why diagnostics
never see anything amiss (and the importance of a good power supply).

FWIW the machine passed everything vts could throw at it for a couple
of days. Anyone got any suggestions for more targeted diagnostics?

There were several questions embedded in the original post, and I'm not
sure any of them have really been answered:

o Why is the file flagged by ZFS as fatally corrupted still accessiible?
   [is this new behavior from b111b vs b125?].

o What possible mechanism could there be for the /calculated/ checksums
   of /four/ copies of just one specific file to be bad and no others?

o Why did this only happen at boot to just this one file which also is
   peculiarly subject to the bitflips you observed, also mostly at boot
  (sometimes at scrub)? I like the feeble power supply answer, but why
  just this one file? Bizarre...

# zpool get  failmode rpool
NAME   PROPERTY  VALUE SOURCE
rpool  failmode  wait  default

This machine is extremely memory limited, so I suspect that libdlpi.so.1 is
not in a cache. Certainly, a brand new copy wouldn't be, and there's no
problem writing and (much later) reading the new copy (or the old one,
for that matter). It remains to be seen if the brand new copy gets clobbered
at boot (the machine, for all it's faults, remains busily up and operational
for months at a time). Maybe I should schedule a reboot out of curiosity :-).


This sort of specific error analysis is possible after b125. See CR6867188
for more details.


Wasn't this in b125? IIRC we upgraded to b125 for this very reason. There
certainly seems to be an overwhelming amount of data in the various logs!

Cheers -- Frank

[1]  This could be (3+1) * 4 where in one instance all 3+1 happen to be the
same. Does ZFS really read all 4 copies 4 times (by fmdump timestamp, 8
within 1uS, 40mS later, another 8,  again within 1uS)? Not sure what the
fmdump timestamps mean, so it's hard to find any pattern.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] CR 6880994 and pkg fix

2010-03-21 Thread Frank Middleton


On 03/15/10 01:01 PM, David Dyer-Bennet wrote:


This sounds really bizarre.


Yes, it is. ButCR 6880994 is bizarre too.
 

One detail suggestion on checking what's going on (since I don't have a
clue towards a real root-cause determination): Get an md5sum on a clean
copy of the file, say from a new install or something, and check the
allegedly-corrupted copy against that.  This can fairly easily give you a
pretty reliable indication if the file is truly corrupted or not.


With many thanks to Danek Duvall, I got a new copy of libdlpi.so.1

# md5sum /lib/libdlpi.so.1
2468392ff87b5810571572eb572d0a41  /lib/libdlpi.so.1
# md5sum /lib/libdlpi.so.1.orig
2468392ff87b5810571572eb572d0a41  /lib/libdlpi.so.1.orig
# zpool status -v

errors: Permanent errors have been detected in the following files:

//lib/libdlpi.so.1.orig

So here we seem to have an example of a ZFS false positive, the first
I've see or heard of. The good news is that it is still possible to read the
file, so this augers well for the ability to boot under this circumstance.
FWIW fmdump does seem to show show actual checksum errors on
all four copies in 16 attempts to read them. There were 3 groups of
different bad checksums; within each group the checksum was the
same but differed from the expected.

Perhaps someone who can could add this to CR 6880994 in the hopes
that it might help lead to a better understanding.

For the casual reader, CR 6880994 is about a pathological PC that
gets checksum errors on the same set of files at boot, even though the
root pool is mirrored. With copies=2, usually ZFS can repair them. But
after a recent power cycle, all 4 copies reported bad checksums but in
reality the the file seems to be uncorrupted. The machine has no ECC
and flaky bus parity, so there are plenty of ways for the data to get
messed up. It's a mystery why this only happens at boot, though.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] CR 6880994 and pkg fix

2010-03-14 Thread Frank Middleton


Can anyone say what the status of CR 6880994 (kernel/zfs Checksum failures on
mirrored drives) might be?

Setting copies=2 has mitigated the problem, which manifests itself consistently 
at
boot by flagging libdlpi.so.1, but two recent power cycles in a row with no 
normal
shutdown has resulted in a permanent error even with copies=2 on all of the 
root
pool (and specifically having duplicated /lib to make sure there are 2 copies).

How can it even be remotely possible to get a checksum failure on mirrored 
drives
with copies=2? That means all four copies were corrupted? Admittedly this is
on a grotty PC with no ECC and flaky bus parity, but how come the same file 
always
gets flagged as being clobbered (even though apparently it isn't).

The oddest part is that libdlpi.so.1 doesn't actually seem to be corrupted. nm 
lists
it with no problem and you can copy it to /tmp, rename it, and then copy it 
back.
objdump and readelf can all process this library with no problem. But pkg fix
flags an error in it's own inscrutable way. CCing pkg-discuss in case a pkg guru
can shed any light on what the output of pkg fix (below) means. Presumably 
libc
is OK, or it wouldn't boot :-).

This with b125 on X86.

# zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
rpool   ONLINE   0 0 0
  mirror-0  ONLINE   0 0 2
c3d1s0  ONLINE   0 0 2
c3d0s0  ONLINE   0 0 2

errors: Permanent errors have been detected in the following files:

//lib/libdlpi.so.1

# pkg fix  SUNWcsl
Verifying: pkg://opensolarisdev/SUNWcsl ERROR
file: lib/libc.so.1
Elfhash: cbb55a2ea24db9e03d9cd08c25b20406896c2fef should be 
0e73a56d6ea0753f3721988ccbd716e370e57c4e
Created ZFS snapshot: 2010-03-13-23:39:17 . ||
Repairing: pkg://opensolarisdev/SUNWcsl
pkg: Requested fix operation would affect files that cannot be modified in 
live image.
Please retry this operation on an alternate boot environment

# nm /lib/libdlpi.so.1
00015562 b Bbss.bss
00015562 b Bbss.bss
00015240 d Ddata.data
00015240 d Ddata.data
000152f8 d Dpicdata.picdata
3ca8 r Drodata.rodata
3ca0 r Drodata.rodata
 A SUNW_1.1
 A SUNWprivate
000150ac D _DYNAMIC
00015562 b _END_
00015000 D _GLOBAL_OFFSET_TABLE_
16c0 T _PROCEDURE_LINKAGE_TABLE_
 r _START_
 U ___errno
 U __ctype
 U __div64
00015562 D _edata
00015562 B _end
43d7 R _etext
3c84 t _fini
 U _fxstat
3c68 t _init
3ca0 r _lib_version
 U _lxstat
 U _xmknod
 U _xstat
 U abs
 U calloc
 U close
 U closedir
 U dgettext
 U dladm_close
 U dladm_dev2linkid
 U dladm_open
 U dladm_parselink
 U dladm_phys_info
 U dladm_walk
2d5c T dlpi_arptype
222c T dlpi_bind
1d6c T dlpi_close
24d0 T dlpi_disabmulti
2c78 T dlpi_disabnotify
24b0 T dlpi_enabmulti
2af4 T dlpi_enabnotify
00015288 d dlpi_errlist
2ce4 T dlpi_fd
25b8 T dlpi_get_physaddr
2e50 T dlpi_iftype
1dc8 T dlpi_info
2d2c T dlpi_linkname
39a8 T dlpi_mactype
000152f8 d dlpi_mactypes
21a0 T dlpi_makelink
1b00 T dlpi_open
2158 T dlpi_parselink
3ca8 r dlpi_primsizes
2598 T dlpi_promiscoff
2578 T dlpi_promiscon
28fc T dlpi_recv
27a4 T dlpi_send
26d4 T dlpi_set_physaddr
2d04 T dlpi_set_timeout
3908 T dlpi_strerror
2d48 T dlpi_style
2384 T dlpi_unbind
1a20 T dlpi_walk
 U free
1998 t fstat
 U getenv
 U gethrtime
 U getmsg
32fc t i_dlpi_attach
3a28 t i_dlpi_buildsap
32ac t i_dlpi_checkstyle
3bfc t i_dlpi_deletenotifyid
39e8 t i_dlpi_getprimsize
3868 t i_dlpi_msg_common
23f4 t i_dlpi_multi
3bd4 t i_dlpi_notifyidexists
3ac8 t i_dlpi_notifyind_process
2f28 t i_dlpi_open
3384 t i_dlpi_passive
24e8 t i_dlpi_promisc
3460 t i_dlpi_strgetmsg
33e4 t i_dlpi_strputmsg
316c t i_dlpi_style1_open
31f0 t i_dlpi_style2_open
19f0 t i_dlpi_walk_link
3a9c t i_dlpi_writesap
 U ifparse_ifspec
 U ioctl
00015240 d libdlpi_errlist
196c t lstat
 U memcpy
 U memset
19c4 t mknod
 U open
 U opendir
 U poll
 U putmsg
 U readdir
 U snprintf
1940 t stat
 U strchr
 U strerror
 U strlcpy
 U strlen
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org

Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-17 Thread Frank Middleton


On 02/17/10 02:38 PM, Miles Nordin wrote:


copies=2 has proven to be mostly useless in practice.


Not true. Take an ancient PC with a mirrored root pool, no
bus error checking and non-ECC memory, that flawlessly
passes every known diagnostic (SMC included).

Reboot with copies=1 and the same files in /usr/lib will get
trashed every time and you'll have to reboot from some other
media to repair it.

Set copies=2 (copy all of /usr/lib, of course) and it will reboot
every time with no problem, albeit with a varying number of
repaired checksum errors, almost always on the same set of
files.

Without copies=2 this hardware would be useless (well, it ran
Linux just fine), but with it, it has a new lease of life. There is
an ancient CR about this, but AFAIK no one has any idea what
the problem is or how to fix it.

IMO it proves that copies=2 can help avoid data loss in the
face of flaky buses and perhaps memory. I don't think you
should be able to lose data on mirrored drives unless both
drives fail simultaneously, but with ZFS you can. Certainly, on
any machine without ECC memory, or buses without ECC (is
parity good enough?) my suggestion would be to set copies=2,
and I have it set for critical datasets even on machines with
ECC on both. Just waiting for the bus that those SAS controllers
are on to burp at the wrong moment...

Is one counter-example enough?

Cheers -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] most of my space is gone

2010-02-06 Thread Frank Middleton


On 02/ 6/10 11:21 AM, Thorsten Hirsch wrote:


I wonder where ~10G have gone. All the subdirs in / use ~4.5G only
(that might be the size of REFER in opensolaris-7), and my $HOME uses
38.5M, that's correct. But since rpool has a size of  15G there must
be more than 10G somewhere.


Do you have any old Boot Environments (BEs) around? In order to
*really* empty /var/pkg/downloads, you have to delete every old BE
because /var/pkg/downloads is protected by BE snapshots. Each new
BE seems to take 5GB or so in  /var/pkg/downloads, so it adds up fast!

AFAIK there is no way to get around this. You can set a flag so that pkg
tries to empty /var/pkg/downloads, but even though it looks empty, it
won't actually become empty until you delete the snapshots, and IIRC
you still have to manually delete the contents. I understand that you
can try creating a separate dataset and mounting it on /var/pkg, but I
haven't tried it yet, and I have no idea if doing so gets around the
BE snapshot problem. Sadly this renders the whole concept of BEs
rather useless if you boot from smallish SSDs or HDs - my workaround
is to keep the old BEs on a backup disks, just like the old UFS days :-)
(snapshots work, too).

HTH -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] most of my space is gone

2010-02-06 Thread Frank Middleton


On 02/ 6/10 11:50 AM, Thorsten Hirsch wrote:

Uhmm... well, no, but there might be something left over.

When I was doing an image-update last time, my / ran out of space. I
even couldn't beadm destroy any old boot environment, because beadm
told me that there's no space left. So what I did was zfs destroy
/rpool/ROOT/opensolaris-6. After that opensolaris-6 didn't show up
anymore in beadm list.


When something similar happened to me when updating to snv111b, I
successfully snapshotted the current BE and zfs send/recv it to a different
disk, and it freed up around 5GB. No one commented on this (a long  time ago
now), but it would be interesting to hear from the experts about the possible
aftermath of running out of space. Presumably zfs list -t snapshots doesn't
show any snapshots at all? If it does, it might be worth while deleting them
to see if there are still any uneeded files in /var/pkg.

On 02/ 6/10 12:33 PM, Bill Sommerfeld wrote:


You can set the environment variable PKG_CACHEDIR to place the cache in
an alternate filesystem.


Cool! Would you know when this feature became available?

Thanks.


 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Home ZFS NAS - 2 drives or 3?

2010-01-30 Thread Frank Middleton


On 01/30/10 05:33 PM, Ross Walker wrote:

On Jan 30, 2010, at 2:53 PM, Mark white...@gmail.com wrote:


I have a 1U server that supports 2 SATA drives in the chassis. I have
2 750 GB SATA drives. When I install opensolaris, I assume it will
want to use all or part of one of those drives for the install. That
leaves me with the remaining part of disk 1, and all of disk 2.

Question is, how do I best install OS to maximize my ability to use
ZFS snapshots and recover if one drive fails?


Where were you planning to send the snapshots? There's been a lot
of discussion about this on this list, but my solution is to mirror the
entire system and zfs send/recv to it periodically to keep a live backup.


Alternatively, I guess I could add a small USB drive to use solely for
the OS and then have all of the 2 750 drives for ZFS. Is that a bad
idea since the OS drive will be standalone?


Just install the OS on the first drive and add the second drive to form
a mirror. There are wikis and blogs on how to add the second drive to
form an rpool mirror.


After more than a year or so of experience with ZFS on drive constrained
systems, I am convinced that it is a really good idea to keep the root pool
and the data pools separate. AFAIK you could set up two slices on each disk
and mirror the results. But actually I'm not sure why you shouldn't use
your USB drive for root pool idea.  If it breaks you simply reinstall (or 
restore
it from a snapshot on your data pool after booting from a CD). I suppose
you could mirror the USB drive, too, but if you can stand the downtime
after a failure, that probably isn't necessary. Of course, SSDs are getting
pretty cheap in bootable sizes and will probably last forever if you don't
swap to them, and that would be an even better solution. USB SSD thumb
drives seem to be quite cheap these days.

The you'd have a full-disk mirrored data pool and a fast bootable OS pool;
if you go the SSD route I'd go for at least 32GB. Of course you could get
a 1TB USB drive to boot from, and use it to keep a backup of the data pool,
but if it failed, you'd be SOL until you replaced it. IMO that would be the
best 3-disk solution. Should be interesting to hear from the gurus about this...

Cheers -- Frank


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Panic running a scrub

2010-01-20 Thread Frank Middleton


On 01/20/10 04:27 PM, Cindy Swearingen wrote:

Hi Frank,

I couldn't reproduce this problem on SXCE build 130 by failing a disk in
mirrored pool and then immediately running a scrub on the pool. It works
as expected.


The disk has to fail whilst the scrub is running. It has happened twice now,
once with the bottom half of the mirror, and again with the top half.
 

Any other symptoms (like a power failure?) before the disk went offline?
It is possible that both disks went offline?


Neither. The system is on a pretty beefy UPS, and one half of the mirror
was definitely online (zpool status just before panic showed one disk
offline and the pool as degraded).


We would like to review the crash dump if you still have it, just let me
know when its uploaded.


Do you need the unix.0, vmcore.0 or both? I'll add either or both as
attachments to newly created Bug 14012, Panic running a scrub,
when you let me know which one(s) you want.

Thanks -- Frank


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Panic running a scrub

2010-01-20 Thread Frank Middleton


On 01/20/10 05:55 PM, Cindy Swearingen wrote:

Hi Frank,

We need both files.


The vmcore is 1.4GB. An http upload is never going to complete.
Is there an ftp-able place to send it, or can you download it if I
post it somewhere?

Cheers -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Panic running a scrub

2010-01-20 Thread Frank Middleton


On 01/20/10 04:27 PM, Cindy Swearingen wrote:

Hi Frank,

I couldn't reproduce this problem on SXCE build 130 by failing a disk in
mirrored pool and then immediately running a scrub on the pool. It works
as expected.


As noted, the disk mustn't go offline until well after the scrub has started.

There's another wrinkle. There are some COMSTAR iscsi targets on this
pool. If there are no initiators accessing any of them, the scrub completes
with no errors after 6 hours. If one specific target is active, the panic
ensues reproducibly at about 5h30m or so.

The precise configuration has 2 disks on one LSI controller as a
mirrored pool (whole disks - no slices). Around 750GB of 1.3TB was
in use when the most recent iscsi target was created. The pool
is read-mostly, so it probably isn't fragmented. The zvol has
copies=1; compression off (no dedupe with snv124). The initiator
is VirtualBox running on Fedora C10 on AMD64 and the target disk
has 32 bit Fedora C12 installed as whole disk, which I believe is EFI.

To reproduce this might require setting up a COMSTAR iscsi
target on a mirrored pool, formatting it with an EFI label, and
then running a scrub. Another, similar, target has OpenSolaris
installed on it, and it doesn't seem to cause a panic on a scrub
if it is running; AFAIK it doesn't use EFI, but I have not run
a scrub with it active since converting to COMSTAR either.

This wouldn't explain why one or the other disk randomly goes
offline and it may be a red herring. But the scrub now runs to
completion just as it always has. Since I can't get FC12 to boot
from the EFI disk in VirtualBox, I may reinstall FC12 without
EFI and see if that makes a difference, but it is an extremely
slow process since it takes almost 6 hours for the panic to occur
each time and there's no practical way to relocate the zvol
to the start of the pool.

HTH -- Frank




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Panic running a scrub

2010-01-19 Thread Frank Middleton


This is probably unreproducible, but I just got a panic whilst
scrubbing a simple mirrored pool on scxe snv124. Evidently
on of the disks went offline for some reason and shortly
thereafter the panic happened. I have the dump and  the
/var/adm/messages containing the trace.

Is there any point in submitting a bug report?

The panic starts with:

Jan 19 13:27:13 host6 ^Mpanic[cpu1]/thread=2a1009f5c80:
Jan 19 13:27:13 host6 unix: [ID 403854 kern.notice] assertion failed: 0 == 
zap_update(dp-dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_SCRUB_BOOKMARK, 
sizeof (uint64_t), 4, dp-dp_scrub_bookmark, tx), file: 
../../common/fs/zfs/dsl_scrub.c, line: 853

FWIW when the system came back up, it resilvered with no
problem and now I'm rerunning the scrub.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Fwd: The 100, 000th beginner question about a zfs server

2009-11-23 Thread Frank Middleton


On 11/23/09 10:10 AM, David Dyer-Bennet wrote:


Is there enough information available from system configuration utilities
to make an automatic HCL (or unofficial HCL competitor) feasible?  Someone
could write an application people could run which would report their
opinion on how well it works, plus the self-reported identity of all key
components?  (It could report uptime, too, as one very small objective
rating of stability.)


IIRC, the HCL doesn't really talk about applications. We have some really
flaky PCs that run Open Solaris beautifully and their uptime is measured
in months (basically only new releases or long power cuts make them
come down). Would I recommend them for a ZFS based server? Not a
chance! But they make super reliable X-Terminals...

As Richard Elling has pointed out so eloquently, a reliable storage
system has to be engineered to minimize or eliminate SPoFS, and I
doubt you'll ever find that on an HCL, which really serves a different
purpose, IMO.

Cheers -- Frank


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] dedupe question

2009-11-12 Thread Frank Middleton


Got some out-of-curiosity questions for the gurus if they
have time to answer:

Isn't dedupe in some ways the antithesis  of setting copies  1?
We go to a lot of trouble to create redundancy (n-way mirroring,
raidz-n, copies=n, etc) to make things as robust as possible and
then we reduce redundancy with dedupe and compression :-).

What would be the difference in MTTDL between a scenario where
dedupe ratio is exactly two and you've set copies=2 vs. no dedupe
and copies=1?  Intuitively MTTDL would be better because of the
copies=2, but you'd lose twice the data when DL eventually happens.

Similarly, if hypothetically dedupe ratio = 1.5 and you have a
two-way mirror, vs. no dedupe and a 3 disk raidz1,  which would
be more reliable? Again intuition says the mirror because there's
one less device to fail, but device failure isn't the only consideration.

In both cases it sounds like you might gain a bit in performance,
especially if the dedupe ratio is high because you don't have to
write the actual duplicated blocks on a write and on a read you
are more likely to have the data blocks in cache. Does this make
sense?

Maybe there are too many variables, but it would be so interesting
to hear of possible decision making algorithms.  A similar discussion
applies to compression, although that seems to defeat redundancy
more directly.  This analysis requires good statistical maths skills!

Thanks -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs code and fishworks fork

2009-10-28 Thread Frank Middleton


On 10/28/09 10:18 AM, Tim Cook wrote:


If Nexenta was too expensive, there's nothing Sun will ever offer that
will fit your price profile. Home electronics is not their business
model and never will be.


True, but this was discussed that on a different thread some time
ago. Sun's prices on X86s are actually quite competitive if you
can even find a comparable machine (i.e, with ECC on buses and
memory). Given the Google report on memory failures that Richard
Elling dug up a while ago, surely no one in their right mind would
want to run anything the least bit important on a machine without
such ECC, and I doubt you could configure a decent file server  /new/
for less than $2K. If you can, I'm sure we'd all like to hear about it!

However, you are certainly correct that Sun's business model isn't
aimed at retail, although one wonders about the size of the market
for robust SOHO/Home file/media servers that no one seems to be
addressing right now (well, Apple, maybe, although they are not
explicit about it and they don't offer ZFS...).

Cheers -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] iscsi/comstar performance

2009-10-18 Thread Frank Middleton


On 10/13/09 18:35, Albert Chin wrote:


Maybe this will help:
   
http://mail.opensolaris.org/pipermail/storage-discuss/2009-September/007118.html


Well, it does seem to explain the scrub problem. I think it might
also explain the slow boot and startup problem - the VM only has
564M available, and it is paging a bit. Doing synchronous i/o for
swap makes no sense. Is there an official way to disable this
behavior?

Does anyone know if the old iscsi system is going to stay around,
or will COMSTAR replace it at some point? The 64K metadata
block at the start of each volume is a bit awkward, too. - it seems
to throw VBox into a tizzy when (failing to) boot MSWXP.

The options seem to be

a) stay with the old method and hope it remains supported

b) figure out a way around the COMSTAR limitations

c) give up and use NFS

Using ZFS as an iscsi backing store for VirtualBox images seemed
like a great idea, so simple to maintain and robust, but COMSTAR
seems to have sand-bagged it a bit. The performance was quite
acceptable before but it is pretty much unusable this way.

Any ideas would be much appreciated

Thanks -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mount ZFS on Dual Boot Machine (open)solaris

2009-10-16 Thread Frank Middleton


On 10/15/09 23:31, Cameron Jones wrote:


by cross-mounting do you mean mounting the drives on 2 running OS's?
that wasn't really what i was looking for but nice to know the option
is there, even tho not recommended!


No, since you really can't run two OSs at the same time unless you use
zones. Maybe someone more expert than I could comment on the idea
of running OpenSolaris on a Solaris 10 or sxce host - e.g., in the case
of sxce, if they were both, say snv124?
 

my only real aim was to have the 3 disks accessible when booting into
either OS so i could share archived data between them.


That's what you should do (and I do it all the time). Put your user
data a separate pool and import only that on both OS instances. So
in your case, install OpenSolaris in a 32GB or more slice 0 partition
of the mirror and /export on (say) slice 1. My data pool is called space
and it has a number of file systems most of which are mounted on
/export (e.g., /export/home/userz for user userz. You could do this
by zfs snap of the OpenSolaris rpool from Solaris, and then zfs recv
after running format (follow the guide for restoring a zfs rpool at
http://docs.sun.com/app/docs/doc/819-5461/ghzur?a=view).


it sounds like i shouldn't have any problem cold-cross-mounting :)
although does bug 11358 only apply to opensolaris or would it also be
possible to apply to solaris 10 too?


Not sure. sxce and Open Solaris both do the dreaded archive update,
so AFAIK Solaris 10 would do it too, possibly with bad consequences.
A workaround would be to make sure the other rpool is not mounted
when you reboot, but one whoops and you might be toast. Better to
keep data and OS  separate. Then you can do zfs snaps for rpool
backups and something different if you like for user data backups.


also i thought i read in the doco that ZFS assigns an id to each
drive which is unique to the OS  - if i try to mount it into another
OS would this id keep changing each time i switch?


AFAIK it doesn't. I have sxce and OpenSolaris running alternately
on one host and they mount the data pool with no problems at all. I
no longer even try to cross mount the rpools because my OpenSolaris
installs kept getting trashed by 11358, but at that time sxce was
on UFS. I believe the ids are assigned when the pool is created,
so if you zfs recv an rpool from another host with an otherwise
identical configuration, it will try (and correctly fail) to mount a
zombie data pool when you boot it. I assume the id is ignored
on the root pool at boot time or it wouldn't be able to boot at all.
Undoubtedly a guru will chip in here if this is incorrect :-)

HTH -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mount ZFS on Dual Boot Machine (open)solaris

2009-10-16 Thread Frank Middleton


On 10/16/09 09:29, I wrote:


I assume the id is ignored on the root pool at boot time or it
wouldn't be able to boot at all. Undoubtedly a guru will chip in here
if this is incorrect :-)


Of course this was hogwash. You create  the pool before receiving
the snapshot, so the ID is local. One of the many nice things about
ZFS is that it is so logically consistent. I'd never want to go back!

Cheers -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] primarycache and secondarycache properties on Solaris 10 u8

2009-10-15 Thread Frank Middleton


IIRC the trigger for this thread was the suggestion that
primarycache=none be set on datasets used for swap.  


Presumably swap only gets used when memory is low or
exhausted, so it would it be correct to say that it wouldn't
make any sense for swap to be in /any/ cache? If this isn't
what primarycache=none means, shouldn't there be a
disable-cache-entirely flag for datasets used for swap?

I guess reads from swap must be buffered somewhere, so
it would be an optimization to have such reads buffered
in a read cache. But wouldn't the read cache be real small
at this point? It's enough to make your head spin :-)

-- Frank


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mount ZFS on Dual Boot Machine (open)solaris

2009-10-15 Thread Frank Middleton


On 10/15/09 20:36, Cameron Jones wrote:


My question is tho, since I can boot into either OpenSolaris or
Solaris (but not both at the same time obviousvly :) i'd like to be
able to mount the other disks into whatever host OS i boot into.



Is this possible  recommended?


Definitely possible. Where do you keep your user data? It isn't
clear that there is much utility in cross mounting rpools from
Solaris/sxce to Open Solaris; better to keep your user data in
one or more separate data pools and to just mount them. That
simplifies backups, too.


Is there any scope for inconsistency if say i upgrade OpenSolaris
with new ZFS versions but continue mounting a mirror in Solaris with
old versions?


You have to watch out for the gratuitous update-archive problem
http://defect.opensolaris.org/bz/show_bug.cgi?id=11358 at
reboot. Otherwise AFAIK you just have to be careful. So far ZFS
seems to have kept backwards compatibility. Just don't accidentally
do a zpool upgrade :-). Because of 11358, I would not recommend
cross mounting the rpools. But it isn't clear that that is what you
really want to achieve...


Many thanks,
cam

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] iscsi/comstar performance

2009-10-13 Thread Frank Middleton


After a recent upgrade to b124, decided to switch to COMSTAR
for iscsi targets for VirtualBox hosted on AMD64 Fedora C10. Both
target and initiator are running zfs under b124. This combination
seems unbelievably slow compared to  the old iscsi subsystem.

A scrub of a local 20GB disk on the target took 16 minutes. A scrub
of a 20GB iscsi disk took 106 minutes! It seems to take much longer
to boot from iscsi, so it seems to be reading more slowly too.

There are a lot of variables - switching to Comstar, snv124, VBox
3.08, etc., but such a dramatic loss of performance probably has a
single cause. Is anyone willing to speculate?

Thanks -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] MPT questions

2009-10-08 Thread Frank Middleton


In an attempt to recycle some old PATA disks, we bought some
really cheap PATA/SATA adapters, some of which actually work
to the point where it is possible to boot from a ZFS installation
(e.g., c1t2d0s0). Not all PATA disks work, just Seagates, it would
seem, but not Maxstors. I wonder why? probe-scsi-all sees
Seagate but not Maxstor disks plugged into the same adapter.

Such disks have proven invaluable as a substitute for rescue
CDs until such CDs become possible.

The odd thing is that booting from another disk, ZFS can't see
the adapted disk even though it is bootable. Could the reason
be that there's no /dev/rdsk/c1t2d0, but there are c1t0d0, etc.?
Format sees the disk but zpool import doesn't (this is on SPARC
sun4u). This isn't at all important, just curious as to why this
might be and why zpool import can't see it at all, but zpool
create can.

Gotta say how happy we are with the MPT driver and the LSI
SAS controller - fast and reliable - petabytes of i/o and not a
single zfs checksum error!

This has little to do with ZFS, but should it be possible to
see a PATA CD or DVD connected to an MPT (LSI) SAS controller
via one of these adapters? Though I'd ask before forking out
for a SATA DVD drive - just hate to put perfectly good drives
out for recycling. Maybe someone can recommend a writable
BlueRay SAS drive  that is known to work with the MPT driver
instead...

Thanks -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-01 Thread Frank Middleton


On 10/01/09 05:08 AM, Darren J Moffat wrote:


In the future there will be a distinction between the local and the
received values see the recently (yesterday) approved case PSARC/2009/510:

http://arc.opensolaris.org/caselog/PSARC/2009/510/20090924_tom.erickson


Currently non-recursive incremental streams send properties and full
streams don't. Will the p flag reverse its meaning for incremental
streams? For my purposes the current behavior is the exact opposite
of what I need and it isn't obvious that the case addresses this
peculiar inconsistency without going through a lot of hoops. I suppose
the new properties can be sent initially so that subsequent incremental
streams won't override the possibly changed local properties, but that
seems so complicated :-). If I understand the case correctly, we can
now set a flag that says ignore properties sent by any future incremental
non-recursive stream. This instead of having a flag for incremental
streams that says don't send properties. What happens if sometimes
we do and sometimes we don't? Sounds like a static property when a
dynamic flag is really what is wanted and this is a complicated way of
working around a design inconsistency. But maybe I missed something :-)

So what would the semantics of the new p flag be for non-recursive
incremental streams?

Thanks -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Would ZFS work for a high-bandwidth video SAN?

2009-09-30 Thread Frank Middleton


On 09/29/09 10:23 PM, Marc Bevand wrote:


If I were you I would format every 1.5TB drive like this:
* 6GB slice for the root fs


As noted in another thread, 6GB is way too small. Based on
actual experience, an upgradable rpool must be more than
20GB. I would suggest at least 32GB; out of 1.5TB that's
still negligible. Recent release notes for image-update say
that at least 8GB free is required for an update. snv111b
as upgraded from a CD installed image takes  11GB without
any user applications like Firefox. Note also that a nominal
1.5TB drive really only has 1.36TB of actual space as reported
by zfs.

Can't speak to the 12-way mirror idea, but if you go this
route you might keep some slices for rpool backups. I have
found having a disk with such a backup invaluable...

How do you plan to do backups in general?

Cheers -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Would ZFS work for a high-bandwidth video SAN?

2009-09-30 Thread Frank Middleton


On 09/30/09 12:59 PM, Marc Bevand wrote:


It depends on how minimal your install is.


Absolutely minimalist install from live CD subsequently updated
via pkg to snv111b. This machine is an old 32 bit PC used now
as an X-terminal, so doesn't need any additional software. It
now has a bigger slice of a larger pair of disks :-). snv122
also takes around 11GB after emptying /var/pkg/download.

# uname -a
SunOS host8 5.11 snv_111b i86pc i386 i86pc Solaris
# df -h
FilesystemSize  Used Avail Use% Mounted on
rpool/ROOT/opensolaris-2
   34G   13G   22G  37% /


There's around 765GB in /var/pkg/download that could be deleted,
and 1GB's worth of snapshots left by previous image-updates,
bringing it down to around 11GB. consistent with a minimalist
SPARC snv122 install with /var/pkg/download emptied and all but
the current BE and all snapshots deleted.


The OpenSolaris install instructions recommend 8GB minimum, I have


It actually says 8GB free space required. This is on top of the
space used by the base installation. This 8GB makes perfect sense
when you consider that the baseline has to be snapshotted, and
new code has to be downloaded and installed in a way that can be
rolled back. I can't explain why the snv111b baseline is 11GB vs.
the 6GB of the initial install, but this was a default install
followed by default image-updates.


one OpenSolaris 2009.06 server using about 4GB, so I thought 6GB
would be sufficient. That said I have never upgraded the rpool of
this server, but based on your commends I would recommend an rpool
of 15GB to the original poster.


The absolute minimum for an upgradable rpool is 20GB, for both
SPARC and X86. This assumes you religiously purge all unnecessary
files (such as /var/pkg/download) and keep swap, /var/dump,
/var/crash and /opt on another disk. You *really* don't want to
run out of space doing an image-update. The result is likely
to require a restore from backup of the rpool, or at best, loss
of some space that seems to vanish down a black hole.

Technically, the rpool was recovered from a baseline snapshot
several times onto a 20GB disk until I figured out empirically
that 8GB of free space was required for the image-update. I
really doubt your mileage will vary. Prudence says that 32GB
is much safer...

Cheers -- Frank


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] OS install question

2009-09-28 Thread Frank Middleton


On 09/28/09 12:40 AM, Ron Watkins wrote:


Thus, im at a loss as to how to get the root pool setup as a 20Gb
slice


20GB is too small. You'll be fighting for space every time
you use pkg. From my considerable experience installing to a
20GB mirrored rpool, I would go for 32GB if you can.

Assuming this is X86, couldn't you simply use fdisk to
create whatever partitions you want and then install to
one of them? Than you should be able to create the data
pool using another partition. You might need to use a
weird partition type temporarily. On SPARC there doesn't
seem to be a problem using slices for different zpools,
in fact it insists on using a slice for the root pool.

Cheers -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Fixing Wikipedia tmpfs article (was Re: Which directories must be part of rpool?)

2009-09-28 Thread Frank Middleton


Trying to move this to a new thread, although I don't think it
has anything to do with ZFS :-)

On 09/28/09 08:54 AM, Chris Gerhard wrote:


TMPFS was not in the first release of 4.0. It was introduced to boost
the performance of diskless clients which no longer had the old
network disk for their root file systems and hence /tmp was now over
NFS.

Whether there was a patch that brought it back into 4.0 I don't
recall but I don't think so. 4.0.1 would have been the first release
that actually had it.
--chris


On 09/28/09 03:00 AM, Joerg Schilling wrote:


I am not sure whether my changes will be kept as wikipedia prefers to
keep badly quoted wrong information before correct information supplied by
people who have first hand information.


They actually disallow first hand information. Everything on Wikipedia
is supposed to be confirmed by secondary or tertiary sources. That's why I
asked if there was any supporting documentation - papers, manuals,
proceedings, whatever, that describe the introduction of tmpfs before
1990. If you were to write a personal page (in Wikipedia if you like)
that describes the history of tmpfs, then you could refer to it in
the tmpfs page as a secondary source. Actually, I suppose if it was
in the source code itself, that would be pretty irrefutable!

http://en.wikipedia.org/wiki/Wikipedia:Reliable_sources

Wikipedia also has a lofi page (http://en.wikipedia.org/wiki/Lofi) that
redirects to loop mount. It has no historical section at all... There
is no fbk (file system) page.

Cheers -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] OS install question

2009-09-28 Thread Frank Middleton


On 09/28/09 01:22 PM, David Dyer-Bennet wrote:


That seems truly bizarre.  Virtualbox recommends 16GB, and after doing an
install there's about 12GB free.


There's no way Solaris will install in 4GB if I understand what
you are saying. Maybe fresh off a CD when it doesn't have to
download a copy first, but the reality is 16GB is not possible
unless you don't want ever to to an image update. What version
are you running? Have you ever  tried pkg image-update?

# uname -a
SunOS host8 5.11 snv_111b i86pc i386 i86pc Solaris
# df -h
Filesystem   Size  Used Avail Use% Mounted on
rpool/ROOT/opensolaris-2  34G   13G   22G  37% /


# du -sh /var/pkg/download/
762M/var/pkg/download/

this after deleting all old BEs and all snapshots but not emptying
/var/pkg/download; swap/boot are on different slices.

SPARC is similar; snv122 takes 11Gb after deleting old BEs, all
snapshots, *and* /var/pkg/downloads; *without* /opt, swap,
/var/crash, /var/dump, /var/tmp, /var/run and /export...

AFAIK It is absolutely impossible to do a pkg image-update (say)
from snv111b to snv122 without at least 9GB free (it says 8GB
in the documentation). If the baseline is 11GB, you need 20GB
for an install, and that leaves you zip to spare.

Obvious reasons include before and after snaps, download before
install, and total rollback capability. This is all going to cost
some space. I believe there is a CR about this, but IMO when
you can get 2TB of disk for $200 it's hard to complain. 32GB
of SSD is not unreasonable and 16GB simply won't hack it.

All the above is based on actual and sometimes painful experience.
You *really* don't want to run out of space during an update. You'll
almost certainly end up restoring your boot disk if you do and if
you don't, you'll never get back all the space. Been there, done
that...

Cheers -- Frank



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-27 Thread Frank Middleton


On 09/27/09 03:05 AM, Joerg Schilling wrote:


BTW: Solaris has tmpfs since late 1987.


Could you fix the Wikipedia article? http://en.wikipedia.org/wiki/TMPFS

it first appeared in SunOS 4.1, released in March 1990
 

It is a de-facto standard since then as it e.g. helps to reduce compile times.


You bet! Provided the compiler doesn't use /var/tmp as IIRC early
versions of gcc once did...

-- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Fixing Wikipedia tmpfs article (was Re: Which directories must be part of rpool?)

2009-09-27 Thread Frank Middleton


On 09/27/09 11:25 AM, Joerg Schilling wrote:

Frank Middletonf.middle...@apogeect.com  wrote:



Could you fix the Wikipedia article? http://en.wikipedia.org/wiki/TMPFS

it first appeared in SunOS 4.1, released in March 1990


It appeared with SunOS-4.0. The official release was probably Februars 1987,
but there have been betas before IIRC.


Do you have any references one could quote so that the Wikipedia
article can be corrected? The section on Solaris is rather skimpy
and could do with some work...

AFAIK this has nothing to do with ZFS. I wonder if we should
move it to another discussion. Apologies to the OP for hijacking
your thread, although I think the original question has been
answered only too thoroughly :-)

Cheers -- Frank



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-26 Thread Frank Middleton


On 09/25/09 09:58 PM, David Magda wrote:


The contents of /var/tmp can be expected to survive between boots (e.g.,
/var/tmp/vi.recover); /tmp is nuked on power cycles (because it's just
memory/swap):


Yes, but does mapping it to /tmp have any issues regarding booting
or image-update in the context of this thread? IMO nuking is a good
thing - /tmp and /var/tmp get really cluttered up after a few months,
the downside of robust hardware and software :-). Not sure I really
care about recovering vi edits in the case of UPS failure...


If a program is creating and deleting large numbers of files, and those
files aren't needed between reboots, then it really should be using /tmp.


Quite. But some lazy programmer of 3rd party software decided to use
the default tmpnam() function and I don't have access to the code :-(.

 tmpnam()
 The tmpnam() function always generates a file name using the
 path  prefix defined as P_tmpdir in the stdio.h header. On
 Solaris  systems,  the  default  value   for   P_tmpdir   is
 /var/tmp.


Similar definition for [/tmp] Linux FWIW:


Yes, but unless they fixed it recently (=RHFC11), Linux doesn't actually
nuke /tmp, which seems to be mapped to disk. One side effect is that (like
MSWindows) AFAIK there isn't a native tmpfs, so programs that create and
destroy large numbers of files run orders of magnitude slower there than
on Solaris - assuming the application doesn't use /var/tmp for them :-).
Compilers and code generators are typical of applications that do this,
though they don't usually do synchronous i/o as said programmer appears
to have done.

I suppose /var/tmp on zfs would never actually write these files unless
they were written synchronously. In the context of this thread, for
those of us with space constrained boot disks/ssds, is it OK to map
/var/tmp to /tmp, and /var/crash, /var/dump, and swap to a separate
data pool in the context of being able to reboot and install new images?
I've been doing so for a long time now with no problems that I know of.
Just wondering what the gurus think...

Havn't seen any definitive response regrading /opt, which IMO should
be a good candidate since the installer makes it a separate fs anyway.
/usr/local can definitely be kept on a separate pool. I wouldn't move
/root. I keep a separate /export/home/root and have root cd to it via
a script in /root that also sets HOME, although I noticed on snv123
that logging on as root succeeded even though it couldn't find bash
(defaulted to using sh). This may be a snv123 bug, but it is a huge
improvement on past behavior. I daresay logging on as root might
also work if root's home directory was awol. Haven't tried it...

Cheers -- Frank





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-26 Thread Frank Middleton


On 09/26/09 12:11 PM, Toby Thain wrote:


Yes, but unless they fixed it recently (=RHFC11), Linux doesn't
actually nuke /tmp, which seems to be mapped to disk. One side
effect is that (like MSWindows) AFAIK there isn't a native tmpfs,
...


Are you sure about that? My Linux systems do.

http://lxr.linux.no/linux+v2.6.31/Documentation/filesystems/tmpfs.txt


OK, so you can mount /dev/shm on /tmp and /var/tmp, but that's
not the default, at least as of RHFC10. I have files in /tmp
going back to Feb 2008 :-). Evidently, quoting Wikipedia,
tmpfs is supported by the Linux kernel from version 2.4 and up.
http://en.wikipedia.org/wiki/TMPFS, FC1 6 years ago. Solaris /tmp
has been a tmpfs since 1990...

Now back to the thread...



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-26 Thread Frank Middleton


On 09/26/09 05:25 PM, Ian Collins wrote:


Most of /opt can be relocated


There isn't much in there on a vanilla install (X86 snv111b)

# ls /opt
DTT  SUNWmlib


http://www.sun.com/bigadmin/features/articles/nvm_boot.jsp


You pretty much answered the OP with this link. Thanks for
posting it!

Cheers -- Frank

 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] White box server for OpenSolaris

2009-09-25 Thread Frank Middleton


On 09/25/09 11:08 AM, Travis Tabbal wrote:

... haven't heard if it's a known
bug or if it will be fixed in the next version...


Out of courtesy to our host, Sun makes some quite competitive
X86 hardware. I have absolutely no idea how difficult it is
to buy Sun machines retail, but it seems they might be missing
out on an interesting market - robust and scalable SOHO servers
for the DYI gang - certainly OEMS like us recommend them,
although there doesn't seem to be a single-box file+application
server in the lineup which might be a disadvantage to some.

Also, assuming Oracle keeps the product line going, we plan to
give them a serious look when we finally have to replace those
sturdy old SPARCS. Unfortunately there aren't entry level SPARCs
in the lineup, but sadly there probably isn't a big enough market
to justify them and small developers don't need the big iron.

It would be interesting to hear from Sun if they have any specific
recommendations for the use of Suns for the DYI SOHO market; AFAIK
it is the profits from hardware that are going a long way to support
Sun's support of FOSS that we are all benefiting from, and there's
a good bet that OpenSolaris will run well on Sun hardware :-)

Cheers -- Frank
 



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread Frank Middleton


On 09/25/09 04:44 PM, Lori Alt wrote:


rpool
rpool/ROOT
rpool/ROOT/snv_124 (or whatever version you're running)
rpool/ROOT/snv_124/var (you might not have this)
rpool/ROOT/snv_121 (or whatever other BEs you still have)
rpool/dump
rpool/export
rpool/export/home
rpool/swap


Unless you machine is so starved for physical memory that
you couldn't possibly install anything, AFAIK you can always
boot without dump and swap, so even if your data pool can't
be mounted, you should be OK. I've done many a reboot and
pkg image-update with dump and swap inaccessible. Of course
with no dump, you won't get, well, a dump, after a panic...

Having /usr/local (IIRC this doesn't even exist in a straight
OpenSolaris install) in a shared space on your data pool is
quite useful if you have more than one machine unless you have
multiple architectures. Then it turns into the /opt problem.

Hiving off /opt does not seem to prevent booting, and having
it on a data pool doesn't seem to prevent upgrade installs.
The big problem with putting /opt on a shared pool is when
multiple hosts have different /opts. Using legacy mounts seems
to be the only way around this. Do the gurus have a technical
explanation why putting /opt in a different pool shouldn't work?

/var/tmp is a strange beast. It can get quite large, and be a
serious bottleneck if mapped to a physical disk and used by any
program that synchronously creates and deletes large numbers of
files. I have had no problems mapping /var/tmp to /tmp. Hopefully
a guru will step in here and explain why this is a bad idea, but
so far no problems...

A 32GB SSD is marginal for a root pool, so shrinking it as much
as possible makes a lot of sense until bigger SSDS become cost
effective (not long from now I imagine). But if you already have
a 16GB or 32GB SSD, or a dedicated boot disk = 32GB than you
can be SOL unless you are very careful to empty /var/pkg/download,
which doesn't seem to get emptied even if you set the magic flag.

HTH -- Frank


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] backup disk of rpool on solaris

2009-09-20 Thread Frank Middleton


On 09/20/09 03:20 AM, dick hoogendijk wrote:

On Sat, 2009-09-19 at 22:03 -0400, Jeremy Kister wrote:

I added a disk to the rpool of my zfs root:
# zpool attach rpool c1t0d0s0 c1t1d0s0
# installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c1t1d0s0

I waited for the resilver to complete, then i shut the system down.

then i physically removed c1t0d0 and put c1t1d0 in it's place.

I tried to boot the system, but it panics:


Afaik you can't remove the first disk. You've created a mirror of two
disks from either which you may boot the system. BUT the second disk
must remain where it is. You can set the bios to boot from it if the
first disk fails, but you may not *swap* them.


That's my experience also. If you are trying to make a bootable
disk to keep on the shelf, there's an excellent example here:
http://forums.sun.com/thread.jspa?threadID=5345546

IMO this should go on the wiki. I think it's a great example of
the power of ZFS. I can't imagine doing anything like this with
so easily with any legacy file system...

Cheers -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Incremental backup via zfs send / zfs receive

2009-09-20 Thread Frank Middleton


A while back I posted a script that does individual send/recvs
for each file system, sending incremental streams if the remote
file system exists, and regular streams if not.

The reason for doing it this way rather than a full recursive
stream is that there's no way to avoid sending certain file
systems such as swap, and it would be nice not to always send
certain properties such as mountpoint, and there might be file
systems you want to keep on the receiving end.

The problem with the regular stream is that most of the file
system properties (such as mountpoint) are not copied as they
are with a recursive stream. This may seem an advantage to some,
(e.g., if the remote mountpoint is already in use, the mountpoint
seems to default to legacy). However, did I miss anything in the
documentation, or would it be worth submitting an RFE for an
option to send/recv properties in a non-recursive stream?

Oddly, incremental non-recursive streams do seem to override
properties, such as mountpoint, hence the /opt problem. Am I
missing something, or is this really an inconsistency? IMO
non-recursive regular and incremental streams should behave the
same way and both have options to send or not send properties.
For my purposes the default behavior is reversed for what I
would like to do...

Thanks -- Frank

Latest version of the  script follows; suggestions for improvements
most welcome, especially the /opt problem where source and destination
hosts have different /opts (host6-opt and host5-opt here) - see
ugly hack below (/opt is on the data pool because the boot disks
- soon to be SSDs - are filling up):

#!/bin/bash
#
# backup is the alias for the host receiving the stream
# To start, do a full recursive send/receive and put the
# name of the initial snapshot in cur_snap, In case of
# disasters, the older snap name is saved in cur_snap_prev
# and there's an option not to delete any snapshots when done.
#
if test ! -e cur_snap; then echo cur_snap not found; exit; fi
P=`cat cur_snap`
mv -f cur_snap cur_snap_prev
T=`date +%Y-%m-%d:%H:%M:%S`
echo $T  cur_snap
echo snapping to sp...@$t
echo Starting backup from sp...@$p to sp...@$t at `date`  snap_time
zfs snapshot -r sp...@$t
echo snapshot done
for FS in `zfs list -H | cut -f 1`
do
RFS=`ssh backup zfs list -H $FS 2/dev/null | cut  -f 1`
case $FS in
space/file system to skip here)
  echo skipping $FS
  ;;
*)
  if test $RFS; then
if [ $FS = space/swap ]; then
  echo skipping $FS
else
  echo do zfs send -i $...@$p $...@$t I ssh backup zfs recv -vF $RFS
  zfs send -i $...@$p $...@$t | ssh backup zfs recv -vF $RFS
fi
  else
echo do zfs send $...@$t I ssh backup zfs recv -v $FS
zfs send $...@$t | ssh backup zfs recv -v $FS
  fi
  if [ $FS = space/host5-opt ]; then
  echo do ssh backup zfs set mountpoint=legacy space/host5-opt
  ssh backup zfs set mountpoint=legacy space/host5-opt
  fi
  ;;
esac
done

echo --Ending backup from sp...@$p to sp...@$t at `date`  snap_time

DOIT=1
while [ $DOIT -eq 1 ]
do
  read -p Delete old snapshot y/n  REPLY
  REPLY=`echo $REPLY | tr '[:upper:]' '[:lower:]'`
  case $REPLY in
y)
  ssh backup zfs destroy -r sp...@$p
  echo Remote sp...@$p destroyed
  zfs destroy -r sp...@$p
  echo Local sp...@$p destroyed
  DOIT=0
  ;;
n)
  echo Skipping:
  echossh backup zfs destroy -r sp...@$p
  echozfs destroy -r sp...@$p
  DOIT=0
  ;;
 *)
  echo Please enter y or n
  ;;
  esac
done



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Reboot seems to mess up all rpools

2009-09-14 Thread Frank Middleton


[Originally posted to indiana-discuss]

On certain X86 machines there's a hardware/software glitch
that causes odd transient checksum failures that always seem
to affect the same files even if you replace them. This has
been submitted as a bug:

Bug 11201 -  Checksum failures on mirrored drives - now
CR 6880994 P4 kernel/zfs Checksum failures on mirrored drives

We have SPARC based ZFS servers where we keep a copy of this
rpool so we can more easily replace the damaged files (usually
system libraries). In addition, to check the validity of the
zfs send stream of the ZFS server rpool, there's a copy of that
as well. For good reasons there might be several rpools in
this data pool at any given time.

When the ZFS server is rebooted, it tries to update the boot
archive of every rpool it can find, including the X86 archive,
which fails because it's the wrong architecture.

The ZFS server is currently at snv103, but the backup server has
an additional disk with snv111b on it, which was recently updated
to snv122. However, if you boot snv103 and then reboot, it will
also update the snv122 boot archive, rendering snv122 unbootable.
All versions up to and including snv122 exhibit this behavior.

I'm not sure why updating the boot archive would do this, but surely
this is a bug. Reboot should only update it's own archive, and not
any ZFS archives at all if it is running from UFS. Before submitting
a bug report, I thought I'd check here to see if a) if this is has
already been reported, and b) if I have the terminology right.

Thanks -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Reboot seems to mess up all rpools

2009-09-14 Thread Frank Middleton


Absent any replies to the list, submitted as a bug:

http://defect.opensolaris.org/bz/show_bug.cgi?id=11358

Cheers -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Raid-Z Issue

2009-09-11 Thread Frank Middleton


On 09/11/09 03:20 PM, Brandon Mercer wrote:


They are so well known that simply by asking if you were using them
suggests that they suck.  :)  There are actually pretty hit or miss
issues with all 1.5TB drives but that particular manufacturer has had
a few more than others.


FWIW I have a few of them in mirrored pools and they have been
working flawlessly for several months now with LSI controllers.
The workload is bursty - mostly MDA driven code generation and
compilation of  1M KLoC applications and they work well enough
for that. Also by now probably a PetaByte of zfs send/recvs and
many scrubs, never a timeout and never a checksum error. They
are all rev CC1H. So your mileage may vary, as they say...

Cheers -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Using ZFS iscsi disk as ZFS disk

2009-09-09 Thread Frank Middleton


Is there any reason why an iscsi disk could not be used
to extend an rpool? It would be pretty amazing if it
could but I thought I'd try it anyway :-)

The 20GB disk I am using to try ZFS booting on SPARC
ran out of space doing an image update to snv122, so I
thought I'd try extending it with an iscsi disk on the
data pool (same machine, different disks).

After formatting the disk with an SMI label, trying to
add the new disk results in

# zpool add rpool c4t600144F04AA7AA68d0
cannot label 'c4t600144F04AA7AA68d0': EFI labeled devices are 
not supported on root pools.
#

Should it be possible to do this (SPARC snv103), and if so,
how to make it work? Use a different iscsi host maybe?
Perhaps I should have used a plain file, or could it be
impossible? Maybe I should split the UFS boot mirror and try
this on one of those disks instead :-(

Separately, I have succeeded in using an iscsi disk (same
hardware) as a ZFS disk in an AMD64 Virtualbox, so it is
possible, although /var/adm/messages is full of messages
like this:

Corrupt label; wrong magic number

even though the disk works just fine in the VM.

Any hints much appreciated

Thanks -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Incremental backup via zfs send / zfs receive

2009-09-08 Thread Frank Middleton


On 09/07/09 07:29 PM, David Dyer-Bennet wrote:


Is anybody doing this [zfs send/recv] routinely now on 2009-6
OpenSolaris, and if so can I see your commands?


Wouldn't a simple recursive send/recv work in your case? I
imagine all kinds of folks are doing it already. The only problem
with it, AFAIK, is when a new fs is created locally without also
being created on the backup disk (unless this now works with
zfs  V3). The following works with snv103. If it works there, it
should work with 2009-6. The script method may have the advantage
of not destroying file systems on the backup that don't exist
on the source, but I have not tested that.

ZFS send/recv is pretty cool, but at least with older versions, it
takes some tweaking to get right. Rather than send to a local drive,
I'm sending to a live remote system, which is some ways is more
complicated since there might be things like /opt and xxx/swap
that you might not want to even send. Finally, at least with ZFS
version 3, an incremental send of a filesystem that doesn't exist
on the far side doesn't work either, so one needs to test for that.

Given this, a simple send of a recursive snapshot AFAIK isn't going to
work. I am no bash expert, so this script probably can do with lots
of improvements, but it seems to do what I need it to do. You would
have to extensively modify it for your local needs; you would have
to remove the ssh backup and fix it to receive to your local disk. I
include it here in response to your request in the hope that it
might be useful. Note, as written, it will create space/swap but it
won't send updates.

The pool I'm backing up is called space and the target host is called
backup, an alias in /etc/hosts. When the machines switch roles, I
edit both /etc/hosts so the stream can go the other way. This script
probably won't work for rpools; there is lots of documentation about
that in previous posts to this list.

My solution to the rpool problem is to receive it locally to an
alternate root and then send that, but this works here if the
rpool isn't your only pool, of course.

If any zfs/bash gurus out there can suggest improvements, they
would be much appreciated, especially ways to deal with the /opt
problem (which probably relates to the general rpool question).
Currently the /opts for each host are set mountpoint=legacy,
but that is not a great solution :-(.

Cheers -- Frank

#!/bin/bash
P=`cat cur_snap`
rm -f cur_snap
T=`date +%Y-%m-%d:%H:%M:%S`
echo $T  cur_snap
echo snapping to sp...@$t
zfs snapshot -r sp...@$t
echo snapshot done
for FS in `zfs list -H | cut -f 1`
do
RFS=`ssh backup zfs list -H $FS 2/dev/null | cut  -f 1`
if test $RFS; then
  if [ $FS = space/swap ]; then
echo skipping $FS
  else
echo do zfs send -i $...@$p $...@$t I ssh backup zfs recv -vF $RFS
zfs send -i $...@$p $...@$t | ssh backup zfs recv -vF $RFS
  fi
else
  echo do zfs send $...@$t I ssh backup zfs recv -v $FS
  zfs send $...@$t | ssh backup zfs recv -v $FS
fi
done

ssh backup zfs destroy -r sp...@$p
zfs destroy -r sp...@$p



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Yet another where did the space go question

2009-09-06 Thread Frank Middleton


An attempt to pkg image-update from snv111b to snv122 failed
miserably for a number of reasons which are probably out of
scope here. Suffice it to say that it ran out of disk space
after the third attempt.

Before starting, I was careful to make a baseline snapshot,
but rolling back to that snapshot has not freed up all the
space - this on a small disk dedicated to experimenting with
ZFS booting on SPARC. The disk is nominally 20GB.

After zfs rollback -rR rpool/ROOT/opensola...@baseline from
a different BE  (snv103 booted from UFS)

# zpool list
NAMESIZE   USED  AVAILCAP  HEALTH  ALTROOT
rpool  17.5G  10.1G  7.39G57%  ONLINE  -
space  1.36T   314G  1.05T22%  ONLINE  -

# zfs list -r -o space rpool
NAMEAVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  
USEDCHILD
rpool   7.11G  10.1G 0 20K  0  
10.1G
rpool/ROOT  7.11G  10.1G 0 18K  0  
10.1G
rpool/ROOT/opensolaris  7.11G  10.1G  942K   10.0G  0  
68.6M
rpool/ROOT/opensolaris/opt  7.11G  68.6M 0   68.6M  0   
   0

Before the aborted pkg image-updates, the rpool took around 6GB,
so 4GB has vanished somewhere. Even if pkg put it's updates
in a well hidden place (there are no hidden directories in / ),
surely the rollback should have deleted them.

# zfs list -t snapshot
NAME USED  AVAIL  REFER  MOUNTPOINT
rp...@baseline  0  -20K  -
rpool/r...@baseline 0  -18K  -
rpool/ROOT/opensola...@baseline  718K  -  10.0G  -
rpool/ROOT/opensolaris/o...@baseline 0  -  68.6M  -

The rollback obviously worked because afterwards even the
pkg set-publisher changes were gone, and other post-snapshot
files were deleted. If the worst come to the worst I could
obviously save the snapshot to a file and then restore it,
but it sure would be nice to know where the 4GB went.

BTW one image-update failure occurred because there was an X86
rpool mounted to an alternate root, and pkg somehow found it
and seemed to get confused about X86 vs. SPARC, insisting on
trying to create a menu.lst in /rpool/boot, which, of course,
doesn't exist on SPARC. I suppose this should be a bug...

Thanks -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Yet another where did the space go question

2009-09-06 Thread Frank Middleton


Correction

On 09/06/09 12:00 PM, I wrote:


(there are no hidden directories in / ),


Well, there is .zfs, of course, but it is normally hidden,
apparently by default on SPARC rpool, but not on X86 rpool
or non-rpool pools on either. Hmmm. I don't recollect setting
the snapdir property on any pools, ever.
-
Arrg! It just failed again!

# pkg image-update --be-name=snv122
DOWNLOADPKGS   FILES XFER (MB)
Completed1486/1486 73091/73091 1520.59/1520.59

WARNING: menu.lst file /rpool/boot/menu.lst does not exist,
 generating a new menu.lst file
pkg: Unable to clone the current boot environment.

# BE_PRINT_ERR=true beadm create newbe
be_get_uuid: failed to get uuid property from BE root dataset user properties.
be_get_uuid: failed to get uuid property from BE root dataset user properties.
# zfs list -t snapshot | grep newbe
rpool/ROOT/opensola...@newbe  30K  -  11.9G  -
rpool/ROOT/opensolaris/o...@newbe0  -  68.6M  -

So it can create a new BE. So what happened this time?
I guess I'll try again with BE_PRINT_ERR=true...

Is the get uuid property failure fatal to pkg but not to beadm? Has
anyone managed to go from snv111b  to snv122 on SPARC?

Thanks -- Frank


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Yet another where did the space go question

2009-09-06 Thread Frank Middleton


Near Success! After 5 (yes, five) attempts, managed to do
an update of snv111b to snv122, until it ran out of space
again. Looks like I need to get a bigger disk...

Sorry about the monolog, but there might be someone on this
list trying to use pkg on SPARC who, like me, has been
unable to subscribe to the indiana list, so an update
might be useful to any such person... Perhaps someone
who can might forward this to the appropriate list --
the issues are known CR's, but don't seem to be mentioned
in the release notes.

On 09/06/09 04:55 PM, I wrote:


WARNING: menu.lst file /rpool/boot/menu.lst does not exist,
generating a new menu.lst file
pkg: Unable to clone the current boot environment.


1) If there isn't a directory /rpool/boot, pkg will fail
2) If you try again after mkdir /rpool/boot, it will
   create menu.1st. If it fails for any reason and
   you have to restart  then:
3) If there is a menu.lst containing  opensolaris-1
   it will fail again even if you had used be-name=.
4) If you delete menu.lst it will fail - touch it after
   deleting it (the CRs are ambiguous about this).

So to do this upgrade, you must do mkdir /rpool/boot
and touch /rpool/boot/menu.lst before you start. It
might just work if you do this, but only if you have
at least 11GB of space to spare (Google says 8GB).

BTW pkg always says /rpool/boot/menu.lst does not exist
even if it does.

http://defect.opensolaris.org/bz/show_bug.cgi?id=6744
says Fixed in source

http://defect.opensolaris.org/bz/show_bug.cgi?id=7880
says accepted. But the fix for 6744 messes up 7880.
This is making a SPARC upgrade really painful, especially
annoying since SPARC doesn't even use grub (or menu.lst).

Cheers -- Frank

PS My hat's off to the ZFS and pkg teams! An amazing
accomplishment and a few glitches are to be expected. I'm
sure there are fixes in the works, but it would seem upgrading
to snv122 isn't in the cards unless I get a bigger 3rd boot
disk...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool

2009-09-03 Thread Frank Middleton


It was someone from Sun that recently asked me to repost here
about the checksum problem on mirrored drives. I was reluctant
to do so because you and Bob might start flames again, and you
did! You both sound very defensive, but of course I would never
make an unsubstantiated speculation that you might have vulnerable
hardware :-). But in case you do, please don't shoot the
messenger...

Instead of being negative, how about some conjectures of your
own about this?. here's a summary of what is happening:

An old machine with mirrored drives and a suspect mobo (maybe
not checking PCI parity) gets checksum errors on reboot and scrub.
With copies=1 it fails to repair them. With copies=2 it apparently
fixes them, but zcksummon shows quite clearly that on a scrub,
zfs finds and repairs them again on every scrub, even though
scrub shows no errors. Typically these files are system
libraries and unless you actually replace them, they are
never truly repaired.

Although I really don't think this is caused by cosmic rays,
are you also saying that PCs without ECC on memory and/or buses
will *never* experience a glitch? You obviously don't play the
lottery :-) [ZFS errors due to memory hits seem far more likely
than winning a 6 ball lottery for typical retail consumer loads]

On 09/02/09 06:54 PM, Tim Cook wrote:


Define more systems.  How many people do you think are on 121?  And of


Absolutely no idea. Enough, though.
 

those, how many are on the zfs mailing list?  And of those, how many


Probably - all of them (yes, this is an unsubstantiated speculation).


have done a scrub recently to see the checksum errors?  Do you have some
proof to validate your beliefs?


If you had read the thread carefully, you would note that a scrub actually
clears the errors (but zcksummon shows that they really aren't cleared). And
doesn't the guide tell us to run scrubs frequently? I am sure we all dutifully
do so :-). I'd be quite happy to send you the proof.


REGARDLESS, had you read all the posts to this thread, you'd know you've
already been proven wrong:


Wrong about what? Reading posts before they are posted?

I have read every post most carefully. Having experienced checksum
failures on mirrored drives for 4 months now (and there's a CR
against snv115 for a similar problem), what exactly do you think I
am trying to prove, or what beliefs? After 4 months of hearing the
hardware being blamed for the checksum problem (which is easy to
reproduce against snv111b), all I'm doing is agreeing that it is
likely triggered by some kind of soft hardware glitch, we just
don't know what the glitch might be. The SPoFs on this machine
are the disk controller, the PCI bus, and memory, (and cpu, of
course). Take your pick.

FWIW it always picks on SUNWcsl (libdlpi.so.1) - 3 or 4 times now,
and more recently, /usr/share/doc/SUNWmusicbrainz/COPYING.bz2.
I am skeptical that the disk controller is picking on certain
files, so that leaves memory and the bus. Take your pick. New
files get added to the list quite infrequently. But it could also
be a pure software bug - some kind of race condition, perhaps.


On Wed, Sep 2, 2009 at 11:15 AM, Brent Jones br...@servuhome.net
mailto:br...@servuhome.net wrote:
I see this issue on each of my X4540's, 64GB of ECC memory, 1TB drives.
Rolling back to snv_118 does not reveal any checksum errors, only
snc_121

So, the commodity hardware here doesn't hold up, unless Sun isn't
validating their equipment (not likely, as these servers have had no
hardware issues prior to this build)


Exactly. My whole point. Glad to hear that Sun hardware is as reliable as
ever!  I hope Richard's new and improved zcksummon will shed more light
on this...

Cheers -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool

2009-09-02 Thread Frank Middleton


On 09/02/09 05:40 AM, Henrik Johansson wrote:


For those of us which have already upgraded and written data to our
raidz pools, are there any risks of inconsistency, wrong checksums in
the pool? Is there a bug id?


This may not be a new problem insofar as it may also affect mirrors.
As part of the ancient mirrored drives should not have checksum
errors thread, I used Richard Elling's amazing zcksummon script
http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon
to help diagnose this (thanks, Richard, for all your help).

The bottom line is that hardware glitches (as found on cheap PCs
without ECC on buses and memory) can put ZFS into a mode where it
detects bogus checksum errors. If you set copies=2, it seems to
always be able to repair them, but they are never actually repaired.
Every time you scrub, it finds a checksum error on the affected file(s)
and it pretends to repair it (or may fail if you have copies=1 set).

Note: I have not tried this on raidz, only mirrors, where it is
highly reproducible. It would be really interesting to see if
raidz gets results similar to the mirror case when running zcksummon.
Note I have NEVER had this problem on SPARC, only on certain
bargain-basement PCs (used as X-Terminals) which as it turns out
have mobos notorious for not detecting bus parity errors.

If this is the same problem, you can certainly mitigate it by
setting copies=2 and actually copying the files (e.g., by
promoting a snapshot, which I believe will do this - can someone
confirm?). My guess is that snv121 has done something to make
the problem more likely to occur, but the problem itself is
quite old (predates snv100). Could you share with us some details
of your hardware, especially how much memory and if it has ECC
orbus parity?

Cheers -- Frank

On 09/02/09 05:40 AM, Henrik Johansson wrote:

Hi Adam,


On Sep 2, 2009, at 1:54 AM, Adam Leventhal wrote:


Hi James,

After investigating this problem a bit I'd suggest avoiding deploying
RAID-Z
until this issue is resolved. I anticipate having it fixed in build 124.





Regards

Henrik
http://sparcv9.blogspot.com http://sparcv9.blogspot.com/




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool

2009-09-02 Thread Frank Middleton


On 09/02/09 10:01 AM, Gaëtan Lehmann wrote:


I see the same problem on a workstation with ECC RAM and disks in mirror.
The host is a Dell T5500 with 2 cpus and 24 GB of RAM.


Would you know if it has ECC on the buses? I have no idea if or what
Solaris does on X86 to check or correct bus errors, but I vaguely
remember seeing a thread about it. Asking, because it really does
seem to require a hardware problem to make this happen.

Did you try zcksummon?

Cheers -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool

2009-09-02 Thread Frank Middleton


On 09/02/09 10:34 AM, Simon Breden wrote:

I too see checksum errors ocurring for the first time using OpenSolaris 2009.06 
on the /dev package repository at version snv_121.

I see the problem occur within a mirrored boot pool (rpool) using SSDs.

Hardware is AMD BE-2350 (ECC) processor with 4GB ECC memory on MCP55 chipset, 
although SATA is using mpt driver on a SuperMicro AOC-USAS-L8i controller card.

More here:
http://breden.org.uk/2009/09/02/home-fileserver-handling-pool-errors/


Boy, that looks familiar. Did you try zcksummon to see if the checksums are
really being fixed? If it is the same problem I encountered, then they are
not, even though the scrub says no errors (and the problem goes back before
snv100). Your hardware seems pretty beefy, though. Note that iostat -Ene
never reported any hard errors in my case even though the mobo was known to
have problems, so hard errors do not explain the problem.

Cheers -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Expanding a raidz pool?

2009-09-02 Thread Frank Middleton




On Sep 2, 2009, at 7:14 PM, rarok wrote:


I'm just a casual at ZFS but you want something that now don't exists.
The most of the consumers want this but Sun is not interested in that
market. To grow a existing RAIDZ just adding more disk to the RAIDZ
would be great but at this moment there isn't anything like that.


Out of curiosity, what do the folks who want to grow their raidzs
do for backups? Is restoring a backup to a newly created enlarged
raidz any more dangerous than the rewriting involved in doing it
on the fly? Hardware is so cheap these days, why not make a backup
raidz server (power it up only to do backups, or better yet, switch
to it periodically to make sure it works), and when the time comes
to make the raidz bigger, just do it, one server at a time? You
can run off the backup whilst the new, larger server is resilvering
and have negligable downtime that way.

If you are really cheap, get a couple of huge USB drives and do
the backups there. Either way, they are important, and zfs
send/recv is such a great way of making verifiable backups.

Cheers -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool

2009-09-02 Thread Frank Middleton


On 09/02/09 12:31 PM, Richard Elling wrote:


I believe this is a different problem. Adam, was this introduced in b120?


Doubtless you are correct as usual. However, if this is a new problem,
how did it get through Sun's legendary testing process unless it is
(as you have always maintained) triggered by a hardware problem? If
so, I believe that any new CR would be regarded as a duplicate of
any CR that described the problem you and I researched, even if they
have different root causes. Of course this seems to be new as of snv121,
so one can only speculate that it might be a latent problem or
a new one. Do you think that there are separate mirror vs. raidz
issues?
 

There is more work that can be leveraged from zcksummon, perhaps
I'll get a few spare moments to test and update the procedure in the next
few days.


If you think it would be relevant, you know I can reproduce this at
will. I wonder if any Sun hardware users have experienced this problem.
So far IIRC the only reports are Asus and Dell. Does anyone else
recollect the thread about how Solaris does (or does not) do bus
error checking on x86?

Cheers -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How Virtual Box handles the IO

2009-07-31 Thread Frank Middleton


Great to hear a few success stories! We have been experimentally
running ZFS on really crappy hardware and it has never lost a
pool. Running on VB with ZFS/iscsi raw disks we have yet to see
any errors at all. On sun4u with lsi sas/sata it is really rock
solid. And we've been going out of our way to break it because of
bad experiences with ntfs, ext2 and UFS as well as many disk
failures (ever had fsck run amok?).

On 07/31/09 12:11 PM, Richard Elling wrote:


Making flush be a nop destroys the ability to check for errors
thus breaking the trust between ZFS and the data on medium.
-- richard


Can you comment on the issue that the underlying disks were,
as far as we know, never powered down? My understanding is
that disks usually try to flush their caches as quickly as
possible to make room for more data, so in this scenario
things were probably quiet after the guest crash, so likely
what ever was in the cache would have been flushed anyway,
certainly by the time the OP restarted VB and the guest.

Could you also comment on CR 6667683. which I believe is proposed
as a solution for recovery in this very rare case? I understand
that the ZILs are allocated out of the general pool. Is there
a ZIL for the ZILs, or does this make no sense?

As the one who started the whole ECC discussion, I don't think
anyone has ever claimed that lack of ECC caused this loss of
a pool or that it could. AFAIK lack of ECC can't be a problem
at all on RAIDZ vdevs, only with single drives or plain mirrors.
I've suggested an RFE for the mirrored case to double buffer
the writes in this case, but disabling checksums pretty much
fixes the problem if you don't have ECC, so it isn't worth
pursuing. You can disable checksum per file system, so this
is an elegant solution if you don't have ECC memory but
you do mirror. No mirror IMO is suicidal with any file system.

Has anyone ever actually lost a pool on Sun hardware other than
by losing too many replicas or operator error? As you have so
eloquently pointed out, building a reliable storage system is
an engineering problem. There are a lot of folks out there who
are very happy with ZFS on decent hardware. On crappy hardware
you get what you pay for...

Cheers -- Frank (happy ZFS evangelist)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-27 Thread Frank Middleton


On 07/27/09 01:27 PM, Eric D. Mudama wrote:


Everyone on this list seems to blame lying hardware for ignoring
commands, but disks are relatively mature and I can't believe that
major OEMs would qualify disks or other hardware that willingly ignore
commands.


You are absolutely correct, but if the cache flush command never makes
it to the disk, then it won't see it. The contention is that by not
relaying the cache flush to the disk, VirtualBox caused the OP to lose
his pool.

IMO this argument is bogus because AFAIK the OP didn't actually power
his system down, so the data would still have been in the cache, and
presumably have eventually have been written. The out-of-order writes
theory is also somewhat dubious, since he was able to write 10TB without
VB relaying the cache flushes. This is all highly hardware dependant,
and AFAIK no one ever asked the OP what hardware he had, instead,
blasting him for running VB on MSWindows. Since IIRC he was using raw
disk access, it is questionable whether or not MS was to blame, but
in general it simply shouldn't be possible to lose a pool under
any conditions.

It does raise the question of what happens in general if a cache
flush doesn't happen if, for example, a system crashes in such a way
that it requires a power cycle to restart, and the cache never gets
flushed. Do disks with volatile caches attempt to flush the cache
by themselves if they detect power down? It seems that the ZFS team
recognizes this as a problem, hence the CR to address it.

It turns out that (at least on this almost 4 year old blog)
http://blogs.sun.com/perrin/entry/the_lumberjack that the ZILs
/are/ allocated recursively from the main pool.  Unless there is
a ZIL for the ZILs, ZFS really isn't fully journalled, and this
could be the real explanation for all lost pools and/or file
systems. It would be great to hear from the ZFS team that writing
a ZIL, presumably a transaction in it's own right, is protected
somehow (by a ZIL for the ZILs?).

Of course the ZIL isn't a journal in the traditional sense, and
AFAIK it has no undo capability the way that a DBMS usually has,
but it needs to be structured so that bizarre things that happen
when something as robust as Solaris crashes don't cause data loss.
The nightmare scenario is when one disk of a mirror begins to
fail and the system comes to a grinding halt where even stop-a
doesn't respond, and a power cycle is the only way out. Who
knows what writes may or may not have been issued or what the
state of the disk cache might be at such a time.

-- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-26 Thread Frank Middleton


On 07/25/09 04:30 PM, Carson Gaspar wrote:


No. You'll lose unwritten data, but won't corrupt the pool, because
the on-disk state will be sane, as long as your iSCSI stack doesn't
lie about data commits or ignore cache flush commands. Why is this so
difficult for people to understand? Let me create a simple example
for you.


Are you sure about this example? AFAIK metadata refers to things like
the file's name, atime, ACLs, etc., etc. Your example seems to be more
about how a journal works, which has little to do with metatdata other
than to manage it.


Now if you were too lazy to bother to follow the instructions properly,
we could end up with bizarre things. This is what happens when storage
lies and re-orders writes across boundaries.


On 07/25/09 07:34 PM, Toby Thain wrote:


The problem is assumed *ordering*. In this respect VB ignoring flushes
and real hardware are not going to behave the same.


Why? An ignored flush is ignored. It may be more likely in VB, but it
can always happen. It mystifies me that VB would in some way alter
the ordering. I wonder if the OP could tell us what actual disks and
controller he used to see if the hardware might actually have done
out-of-order writes despite the fact that ZFS already does write
optimization. Maybe the disk didn't like the physical location of
the log relative to the data so it wrote the data first? Even then
it isn't onvious why this would cause the pool to be lost.

A traditional journalling file system should survive the loss pf a flush.
Either the log entry was written or it wasn't. Even if the disk, for
some bizarre reason, writes some of the actual data before writing the
log, the repair process should undo that,

If written properly, it will use the information in the most current
complete journal entry to repair the file system. Doing synchs are
devastating to performance so usually there's an option to disable
them, at the known risk of losing a lot more data. I've been using
SPARCs and Solaris from the beginning. Ever since UFS supported
journalling, I've never lost a file unless the disk went totally bad,
and none since mirroring. Didn't miss fsck either :-)

Doesn't ZIL effectively make ZFS into a journalled file system (in
another thread, Bob Friesenhahn says it isn't, but I would submit
that the general opinion is correct that it is; log and journal
have similar semantics). The evil tuning guide is pretty emphatic
about not disabling it!

My intuition (and this is entirely speculative) is that the ZFS ZIL
either doesn't contain everything needed to restore the superstructure,
or that if it does, the recovery process is ignoring it. I think I read
that the ZIL is per-file system, but one hopes it doesn't rely on the
superstructure recursively, or this would be impossible to fix (maybe
there's a ZIL for the ZILs :) ).

On 07/21/09 11:53 AM, George Wilson wrote:


We are working on the pool rollback mechanism and hope to have that
soon. The ZFS team recognizes that not all hardware is created equal and
thus the need for this mechanism. We are using the following CR as the
tracker for this work:

6667683 need a way to rollback to an uberblock from a previous txg


so maybe this discussion is moot :-)

-- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-25 Thread Frank Middleton


On 07/25/09 02:50 PM, David Magda wrote:


Yes, it can be affected. If the snapshot's data structure / record is
underneath the corrupted data in the tree then it won't be able to be
reached.


Can you comment on if/how mirroring or raidz mitigates this, or tree
corruption in general? I have yet to lose a pool even on a machine
with fairly pathological problems, but it is mirrored (and copies=2).

I was also wondering if you could explain why the ZIL can't
repair such damage.

Finally, a number of posters blamed VB for ignoring a flush, but
according to the evil tuning guide, without any application syncs,
ZFS may wait up to 5 seconds before issuing a synch, and there
must be all kinds of failure modes even on bare hardware where
it never gets a chance to do one at shutdown. This is interesting
if you do ZFS over iscsi because of the possibility of someone
tripping over a patch cord or a router blowing a fuse. Doesn't
this mean /any/ hardware might have this problem, albeit with much
lower probability?

Thanks

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] The importance of ECC RAM for ZFS

2009-07-24 Thread Frank Middleton


On 07/24/09 04:35 PM, Bob Friesenhahn wrote:

 Regardless, it [VirtualBox] has committed a crime.


But ZFS is a journalled file system! Any hardware can lose a flush;
it's just more likely in a VM, especially when anything Microsoft
is involved, and the whole point of journalling is to prevent things
like this happening. However the issue is moot since CR 6667683 is
being addressed. Here's a related thought - does it make sense to
mirror ZFS on iscsi if the host drives are themselves ZFS mirrors?

The whole question of the requirement for ECC depends on your
tolerance for loss of files vs. errors in files. As Richard
Elling points out, there are other sources of error (e.g.,
no checking of PCI parity). But that isn't relevant to the ECC
on main memory question. You can disable checksumming, and then
ZFS is no worse in this regard than any other file system; bad
files get read and you either notice or you don't, but you won't
lose any because of fatal checksum errors and you still have all
the other great features of ZFS,

If you don't mirror, all bets are off. You should set copies=2 or
higher and cross your fingers. You should also disable file
checksumming in ZFS and in that sense degenerate to the behavior
of lesser file systems. However mirroring doesn't buy you much
here because it evidently doesn't double buffer the write before
calculating the checksum, so a stray bitflip can cause metatdata or
data corruption, causing a mirrored file to have an unrecoverable
checksum failure (of course there are many other reasons to mirror).

The real question is - what is the probability of this occurring?
IMO the typical SOHO user has a 1 in 10 to 1 in 100 chance of this
happening in a year of reasonably constant operation (a few dozen
writes/day). I believe that this can be mitigated by setting
copies=2, a good idea anyway if you have biggish disks since, as
Richard Elling has pointed out in his excellent blogs, if you need
to resilver after a disk failure you have a rather large possibility
of a disk read error causing file loss and copies=2 also mitigates
this. Note that hopefully fixing CR 6667683 should eliminate any
possibility of losing an entire mirrored or raidz pool.

So, it seem to me ZFS has a definite dependency on ECC for reliable
operation. However, for non-commercial uses (i.e., less than an
hour or so a day of writes) the probability of losing a file is
fairly small and can be mitigated still further by setting copies=2.
But to eliminate the possibility entirely, you must have ECC. You
should also make sure that the buses have at least parity if not
ECC and that this is actually checked - maybe Richard can comment
on this since I believe he thinks this is a more likely source
of errors.

HTH -- Frank







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-23 Thread Frank Middleton


On 07/21/09 01:21 PM, Richard Elling wrote:


I never win the lottery either :-)


Let's see. Your chance of winning a 49 ball lottery is apparently
around 1 in 14*10^6, although it's much better than that because of
submatches (smaller payoffs for matches on less than 6 balls).

There are about 32*10^6 seconds in a year. If ZFS saves its writes
for 30 seconds and batches them out, that means 1 write leaves the
buffer exposed for roughly one millionth of a year. If you have 4GB
of memory, you might get 50  errors a year, but you say ZFS uses only
1/10 of this for writes, so that memory could see 5 errors/year. If
your single write was 1/70th of that (say around 6 MB), your chance
of a hit is around 5/70/10^-6 or 1 in 14*10^6, so you are correct!

So if you do one 6MB write/year, your chances of a hit in a year are
about the same as that of winning a grand slam lottery. Hopefully
not every hit will trash a file or pool, but odds are that you'll
do many more writes than that, so on the whole I think a ZFS hit
is quite a bit more likely than winning the lottery each year :-).

Conversely, if you average one big write every 3 minutes or so (20%
occupancy), odds are almost certain that you'll get one hit a year.
So some SOHO users who do far fewer writes won't see any hits (say)
over a 5 year period. But some will, and they will be most unhappy --
calculate your odds and then make a decision! I daresay the PC
makers have done this calculation, which is why PCs don't have ECC,
and hence IMO make for insufficiently reliable servers.

Conclusions from what I've gleaned from all the discussions here:
if you are too cheap to opt for mirroring, your best bet is to
disable checksumming and set copies=2. If you mirror but don't
have ECC then at least set copies=2 and consider disabling checksums.
Actually, set copies=2 regardless, so that you have some redundancy
if one half of the mirror fails and you have a 10 hour resilver,
in which time you could easily get a (real) disk read error.

It seems to me some vendor is going to cotton onto the SOHO server
problem and make a bundle at the right price point. Sun's offerings
seem unfortunately mostly overkill for the SOHO market, although the
X4140 looks rather interesting... Shame there aren't any entry
level SPARCs any more :-(. Now what would doctors' front offices do
if they couldn't blame the computer for being down all the time?
 

It is quite simple -- ZFS sent the flush command and VirtualBox
ignored it. Therefore the bits on the persistent store are consistent.


But even on the most majestic of hardware, a flush command could be
lost, could it not? An obvious case in point is ZFS over iscsi and
a router glitch. But the discussion seems to be moot since CR
6667683 is being addressed. Now about those writes to mirrored disks :)

Cheers -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-20 Thread Frank Middleton


On 07/19/09 06:10 PM, Richard Elling wrote:


Not that bad. Uncommitted ZFS data in memory does not tend to
live that long. Writes are generally out to media in 30 seconds.


Yes, but memory hits are instantaneous. On a reasonably busy
system there may be buffers in queue all the time. You may have
a buffer in memory for 100uS but it only takes 1nS for that buffer
to be clobbered. If that happened to be metadata about to be written
to both sides of a mirror than you are toast.  Good thing this
never happens, right :-)
 

Beware, if you go down this path of thought for very long, you'll soon
be afraid to get out of bed in the morning... wait... most people actually
die in beds, so perhaps you'll be afraid to go to bed instead :-)


Not at all. As with any rational business, my servers all have ECC,
and getting up and out isn't a problem :-). Maybe I've had too many
disks go bad, so I have ECC, mirrors, and backup to a system with
ECC and mirrors (and copies=2, as well). Maybe I've read too many
of your excellent blogs :-).


Sun doesn't even sell machines without ECC. There's a reason for that.



Yes, but all of the discussions in this thread can be classified as
systems engineering problems, not product design problems.


Not sure I follow. We've had this discussion before. OSOL+ZFS lets
you build enterprise class systems on cheap hardware that has errors.
ZFS gives the illusion of being fragile because it, uniquely, reports
these errors. Running OSOL as a VM in VirtualBox using MSWanything
as a host is a bit like building on sand, but there's nothing in
documentation anywhere to even warn folks that they shouldn't rely
on software to get them out of trouble on cheap hardware. ECC is
just one (but essential) part of that.

On 07/19/09 08:29 PM, David Magda wrote:


It's a nice-to-have, but at some point we're getting into the tinfoil
hat-equivalent of data protection.


But it is going to happen! Sun sells only machines with ECC because
that is the only way to ensure reliability. Someone who spends weeks
building a media server at home isn't going to be happy if they lose
one media file let alone a whole pool. At least they should be warned
that without ECC at some point they will lose files. I'm not convinced
that there is any reasonable scenario for losing an entire pool though,
which was the original complaint in this thread.

Even trusty old SPARCs occasionally hang without a panic (in my
experience especially when a disk is about to go bad). If this
happens, and you have to power cycle because even stop-A doesn't
respond, are you all saying that there is a risk of losing a pool
at that point? Surely the whole point of a journalled file system
is that it is pretty much proof against any catastrophe, even the
one described initially.

There have been a couple of (to me) unconvincing explanations of
how this pool was lost. Surely if there is a mechanism whereby
unflushed i/os can cause fatal metadata corruption, this should
be a high priority bug since this can happen on /any/ hardware; it
is just more likely if the foundations are shaky, so the explanation
must require more than that if it isn't a bug.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Frank Middleton


On 07/19/09 05:00 AM, dick hoogendijk wrote:


(i.e. non ECC memory should work fine!) / mirroring is a -must- !


Yes, mirroring is a must, although it doesn't help much if you
have memory errors (see several other threads on this topic):

http://en.wikipedia.org/wiki/Dynamic_random_access_memory#Errors_and_error_correction
 
Tests[ecc]give widely varying error rates, but about 10^-12

error/bit·h is typical, roughly one bit error, per month, per
gigabyte of memory.

That's roughly 1 per week in 4GB. If 1 error in 50 results in a ZFS
hit, that's one/year per user on average. Some get more, some get
less.That sounds like pretty bad odds...

In most computers used for serious scientific or financial computing
and as servers, ECC is the rule rather than the exception, as can be
seen by examining manufacturers' specifications. Sun doesn't even
sell machines without ECC. There's a reason for that.

IMO you'd be nuts to run ZFS on a machine without ECC unless
you don't care about losing some or all of the data. Having
said that, we have yet to lose an entire pool - this is pretty
hard to do! I should add that since setting copies=2 and forcing
the files to be copied, there have been no more unrecoverable
errors on a particularly low end machine that was plagued with
them even with mirrors (and a UPS with a bad battery :-) ).
 
On 19-Jul-09, at 7:12 AM, Russel wrote:



As this was not clear to me. I use VB like others use vmware
etc to run solaris because its the ONLY way I can,


Given that PC hardware is so cheap these days (used SPARCS
even cheaper), surely it makes far more sense to build a nice
robust OSOL/ZFS based file server *with* ECC. Then you can use
iscsi for your VirtualBox VMs and solve all kinds of interesting
problems. But you still need to do backups. My solution for
that is to replicate the server and backup to it using zfs
send/recv. If a disk fails, you switch to the backup and no
worries about the second disk of the mirror failing during a
resilver.  A small price to pay for peace of mind.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] no pool_props for OpenSolaris 2009.06 with old SPARC hardware

2009-06-12 Thread Frank Middleton


On 06/03/09 09:10 PM, Aurélien Larcher wrote:


PS: for the record I roughly followed the steps of this blog entry =  
http://blogs.sun.com/edp/entry/moving_from_nevada_and_live


Thanks for posting this link! Building pkg with gdb was an
interesting exercise, but it worked, with the additional step of
making the packages and pkgadding them. Curious as to why pkg
isn't available as a pkgadd package. Is there any reason why
someone shouldn't make them available for download? It would
make it much less painful for those of us who are OBP version
deprived - but maybe that's the point :-)

During the install cycle, ran into this annoyance (doubtless this
is documented somewhere):

# zpool create rpool c2t2d0
creates a good rpool that can be exported and imported. But it
seems to create an EFI label, and, as documented, attempting to boot
results in a bad magic number error. Why does zpool silently create
an apparently useless disk configuration for a root pool? Anyway,
it was a good opportunity to test zfs send/recv of a root pool (it
worked like a charm).

Using format -e to relable the disk so that slice 0 and slice 2
both have the whole disk resulted in this odd problem:

# zpool create -f  rpool c2t2d0s0
# zpool list
NAMESIZE   USED  AVAILCAP  HEALTH  ALTROOT
rpool  18.6G  73.5K  18.6G 0%  ONLINE  -
space  1.36T   294G  1.07T21%  ONLINE  -
# zpool export rpool
# zpool import rpool
cannot import 'rpool': no such pool available
# zpool list
NAMESIZE   USED  AVAILCAP  HEALTH  ALTROOT
space  1.36T   294G  1.07T21%  ONLINE  -

# zdb -l /dev/dsk/c2t2d0s0
lists 3 perfectly good looking labels.
Format says:
...
selecting c2t2d0
[disk formatted]
/dev/dsk/c2t2d0s0 is part of active ZFS pool rpool. Please see zpool(1M).
/dev/dsk/c2t2d0s2 is part of active ZFS pool rpool. Please see zpool(1M).

However this disk boots ZFS OpenSolaris just fine and this inability to
import an exported pool isn't a problem. Just wondering if any ZFS guru
had a comment about it. (This is with snv103 on SPARC). FWIW this is
an old ide drive connected to a sas controller via a sata/pata adapter...

Cheers -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] no pool_props for OpenSolaris 2009.06 with old SPARC hardware

2009-06-12 Thread Frank Middleton


On 06/03/09 09:10 PM, Aurélien Larcher wrote:


PS: for the record I roughly followed the steps of this blog entry =  
http://blogs.sun.com/edp/entry/moving_from_nevada_and_live


Thanks for posting this link! Building pkg with gcc 4.3.2 was an
interesting exercise, but it worked, with the additional step of
making the packages and pkgadding them. Curious as to why pkg
isn't available as a pkgadd package. Is there any reason why
someone shouldn't make them available for download? It would
make it much less painful for those of us who are OBP version
deprived - but maybe that's the point :-)

During the install cycle, ran into this annoyance (doubtless this
is documented somewhere):

# zpool create rpool c2t2d0
creates a good rpool that can be exported and imported. But it
seems to create an EFI label, and, as documented, attempting to boot
results in a bad magic number error. Why does zpool silently create
an apparently useless disk configuration for a root pool? Anyway,
it was a good opportunity to test zfs send/recv of a root pool (it
worked like a charm).

Using format -e to relable the disk so that slice 0 and slice 2
both have the whole disk resulted in this odd problem:

# zpool create -f  rpool c2t2d0s0
# zpool list
NAMESIZE   USED  AVAILCAP  HEALTH  ALTROOT
rpool  18.6G  73.5K  18.6G 0%  ONLINE  -
space  1.36T   294G  1.07T21%  ONLINE  -
# zpool export rpool
# zpool import rpool
cannot import 'rpool': no such pool available
# zpool list
NAMESIZE   USED  AVAILCAP  HEALTH  ALTROOT
space  1.36T   294G  1.07T21%  ONLINE  -

# zdb -l /dev/dsk/c2t2d0s0
lists 3 perfectly good looking labels.
Format says:
...
selecting c2t2d0
[disk formatted]
/dev/dsk/c2t2d0s0 is part of active ZFS pool rpool. Please see zpool(1M).
/dev/dsk/c2t2d0s2 is part of active ZFS pool rpool. Please see zpool(1M).

However this disk boots ZFS OpenSolaris just fine and this inability to
import an exported pool isn't a problem. Just wondering if any ZFS guru
had a comment about it. (This is with snv103 on SPARC). FWIW this is
an old ide drive connected to a sas controller via a sata/pata adapter...

Cheers -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rpool mirroring

2009-06-04 Thread Frank Middleton


On 06/04/09 06:44 PM, cindy.swearin...@sun.com wrote:

Hi Noz,

This problem was reported recently and this bug was filed:

6844090 zfs should be able to mirror to a smaller disk


Is this filed on bugs or defects? I had the exact same problem,
and it turned out to be a rounding error in Solaris format/fdisk.
The only way I could fix it was to use Linux (well, Fedora) sfdisk
to make both partitions exactly the same number of bytes. The
alternates partition seems to be hard wired on older disks  and
AFAIK there's no way to use that space. sfdisk is on the Fedora
live CD if you don't have a handy Linux system to get it from.
BTW the disks were nominally the same size but had different
geometries.

Since I can't find 6844090, I have no idea what it says, but this
really seems to be a bug in fdisk, not ZFS, although I would think
ZFS should be able to mirror to a disk that is only a tiny bit
smaller...

-- Frank
 

I believe slice 9 (alternates) is an older method for providing
alternate disk blocks on x86 systems. Apparently, it can be removed by
using the format -e command. I haven't tried this though.

I don't think removing slice 9 will help though if these two disks
are not identical, hence the bug.

You can workaround this problem by attaching a slightly larger disk.

Cindy


noz wrote:

I've been playing around with zfs root pool mirroring and came across
some problems.

I have no problems mirroring the root pool if I have both disks
attached during OpenSolaris installation (installer sees 2 disks).

The problem occurs when I only have one disk attached to the system
during install. After OpenSolaris installation completes, I attach the
second disk and try to create a mirror but I cannot.

Here are the steps I go through:
1) install OpenSolaris onto 16GB disk
2) after successful install, shutdown, and attach second disk (also 16GB)
3) fdisk -B
4) partition
5) zfs attach

Step 5 fails, giving a disk too small error.

What I noticed about the second disk is that it has a 9th partition
called alternates that takes up about 15MBs. This partition doesn't
exist in the first disk and I believe is what's causing the problem. I
can't figure out how to delete this partition and I don't know why
it's there. How do I mirror the root pool if I don't have both disks
attached during OpenSolaris installation? I realize I can just use a
disk larger than 16GBs, but that would be a waste.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Errors on mirrored drive

2009-05-29 Thread Frank Middleton


On 05/26/09 13:07, Kjetil Torgrim Homme wrote:

also thank you, all ZFS developers, for your great job :-)


I'll second that! A great achievement - puts Solaris in a league of
its own, so much so, you'd want to run it on all your hardware,
however crappy the hardware might be ;-)

There are too many branches in this thread now. Going to summarize here
without responding to some of the less than helpful comments, although
death and taxes seems an ironic metaphor in the current climate :-)

In some ways this isn't a technical issue. This much maligned machine
and its ilk are running Solaris and ZFS quite happily and the users
are pleased with the stability and performance. But their applications
are running on machines (via xdmcp) with ECC, and ZFS mirror/raidz
doesn't have a problem there.

Picture a new convert with enthusiasm for ZFS, but has a less than
perfect PC which has otherwise been apparently quite reliable. Perhaps
it already has mirrored drives. He/she installs Solaris from the live
CD (and finds that the installer doesn't support mirroring). The
install fails, or worse, afterwords he/she loses that movie of
Aunt Minnie playing golf, because a checksum error makes the file
unrecoverable. This could be very frustrating and make the blogosphere
go crazy, especially if the PC passes every diagnostic. Be even
worse if a file is lost on a mirror.

Unrecoverable files on mirrored drives simply shouldn't happen. What
kind of hardware error (other than a rare bit flip) could conceivably
cause 5 out of 15 checksum errors to be unrecoverable when mirrored
during the write of around 20*10^10 bits? ZFS has both a larger spatial
and temporal footprint than other file systems, so it is slightly more
vulnerable to the once-a-month on average bit flip that will afflict
many a PC with 4GB of memory.

Perhaps someone with a statistical bent could step in and actually
calculate the probability of random errors, perhaps assuming that
half of available memory is used to queue writes, that there is
a 95% chance of one bit flip per month per 4GB, and there is a
(say) 10% duty cycle over say a period of a year. Alternatively,
the chance of a 1 bit flip over a period of 6 hours at a 100% duty
cycle repeated 1461 times (1461 installs per year at 100%). Seems
to me intuitively that 6 out of 1461 installs will fail due to an
unrecoverable checksum failure, but I'm not a statistician.

Multiply that failure rate by the number of Live CD installs
you expect over the next year (noting that *all* checksum
failures are unrecoverable without mirroring) and you'll count
quite a few frustrated would-be installers. Maybe ZFS without ECC
and no mirroring should disable checksumming by default - it
would be a little worse than UFS and ext3 (due to its larger
spatial and temporal footprints) but still provide all the other
great features.

Proposed RFE #1

Add option to make files with unrecoverable checksum failures readable
and to pass the best image possible back to the application. [How
much do you bet most folks would select this option?]

If both sides of the mirror could be read, it might help to diagnose
the problem, which obviously must be in the hardware somewhere. If
both images are identical, then it surely must be memory. If they
differ, then what could it be?

Proposed RFE#2

Add an option for machines with mirrored drives but without ECC to
double buffer and only then calculate the checksums (for those
who are reasonably paranoid about cosmic rays).

Proposed RFE#3 (or is this a bug report?)

Add diagnostics to the ZFS recv to help understand why a perfectly
good ZFS send can't be received when the same machine can successfully
compute a md5sum over the same stream. Even something like recv failed
at block nnn would help. For example, it seems to fail suspiciously
close to 2GB on a 32 bit machine.

Proposed RFE #4

Disable checksumming by default if no mirroring and no ECC is
detected. (Of course this assumes a install to mirror option).
If it could still checksum, but make it a warning instead of an
error, this could turn into a great feature for cheapskates with
machines that have no ECC.

---

1 and #2 above could be fixed in the documentation. Random memory
bit flips can theoretically cause unrecoverable checksum failures,
even if the data is mirrored. Either disable the checksum feature
or only run ZFS on systems with ECC memory if you have any data you
don't want to risk losing [even with a 1 bit error].

None of this is meant as a criticism of ZFS, just suggestions to help
make a merely superb file system into the unbeatable one it should be.
(I suppose it really is a system of file systems, but ZFS it is...)

Regards -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Frank Middleton


On 05/23/09 10:21, Richard Elling wrote:

preface
This forum is littered with claims of zfs checksums are broken where
the root cause turned out to be faulty hardware or firmware in the data
path.
/preface

I think that before you should speculate on a redesign, we should get to
the root cause.


The hardware is clearly misbehaving. No argument. The questions is - how
far out of reasonable behavior is it?

Redesign? I'm not sure I can conceive an architecture that would make
double buffering difficult to do. It is unclear how faulty hardware or
firmware could be responsible for such a low error rate (1 in 4*10^10).
Just asking if an option for machines with no ecc and their inevitable
memory errors is a reasonable thing to suggest in an RFE.


The checksum occurs in the pipeline prior to write to disk.
So if the data is damaged prior to checksum, then ZFS will
never know. Nor will UFS. Neither will be able to detect
this. In Solaris, if the damage is greater than the ability
of the memory system and CPU to detect or correct, then
even Solaris won't know. If the memory system or CPU
detects a problem, then Solaris fault management will kick
in and do something, preempting ZFS.


Exactly. My whole point. And without ECC there's no way of knowing.
But if the data is damaged /after/ checksum but /before/ write, then
you have a real problem...


Memory diagnostics just test memory. Disk diagnostics just test disks.


This is not completely accurate. Disk diagnostics also test the
data path. Memory tests also test the CPU. The difference is the
amount of test coverage for the subsystem.


Quite. But the disk diagnostic doesn't really test memory beyond what
it uses to run itself. Likewise it may not test the FPU forexample.


ZFS keeps disks pretty busy, so perhaps it loads the power supply
to the point where it heats up and memory glitches are more likely.


In general, for like configurations, ZFS won't keep a disk any more
busy than other file systems. In fact, because ZFS groups transactions,
it may create less activity than other file systems, such as UFS.


That's a point in it's favor, although not really relevant. If the disks
are really busy they will load the PSU more and that could drag the supply
down which in turn might make errors occur that otherwise wouldn't.


Ironically, the Open Solaris installer does not allow for ZFS
mirroring at install time, one time where it might be really important!
Now that sounds like a more useful RFE, especially since it would be
relatively easy to implement. Anaconda does it...


This is not an accurate statement. The OpenSolaris installer does
support mirrored boot disks via the Automated Installer method.
http://dlc.sun.com/osol/docs/content/2008.11/AIinstall/index.html
You can also install Solaris 10 to mirrored root pools via JumpStart.


Talking about the live CD here. I prefer to install via jumpstart, but
AFAIK Open Solaris (indiana) isn't available as an installable DVD. But
most consumers are going to be installing from the live CD and they
are the ones with the low end hardware without ECC. There was recently
a suggestion on another thread about an RFE to add mirroring as an
install option.
 

I think a better test would be to md5 the file from all systems
and see if the md5 hashes are the same. If they are, then yes,
the finger would point more in the direction of ZFS. The
send/recv protocol hasn't changed in quite some time, but it
is arguably not as robust as it could be.


Thanks! md5 hash is exactly the kind of test I was looking for.
ms5sum on SPARC 9ec4f7da41741b469fcd7cb8c5040564 (local ZFS)
md5sum on X86   9ec4f7da41741b469fcd7cb8c5040564 (remote NFS)


ZFS send/recv use fletcher4 for the checksums. ZFS uses fletcher2
for data (by default) and fletcher4 for metadata. The same fletcher
code is used. So if you believe fletcher4 is broken for send/recv,
how do you explain that it works for the metadata? Or does it?
There may be another failure mode at work here...
(see comment on scrubs at the end of this extended post)

[Did you forget the scrubs comment?]

Never said it was broken. I assume the same code is used for both SPARC
and X86, and it works fine on SPARC. It would seem that this machine
gets memory errors so often (even though it passes the Linux memory
diagnostic) that it can never get to the end of a 4GB recv stream. Odd
that it can do the md5sum, but as mentioned, perhaps doing the i/o
puts more strain on the machine and stresses it to where more memory
faults occur. I can't quite picture a software bug that would cause
random failures on specific hardware and I am happy to give ZFS the
benefit of the doubt.


It would have been nice if we were able to recover the contents of the
file; if you also know what was supposed to be there, you can diff and
then we can find out what was wrong.


file on those files resulted in bus error. Is there a way to actually
read a file reported by ZFS as unrecoverable to do just that (and to
separately

Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Frank Middleton


On 05/26/09 03:23, casper@sun.com wrote:


And where exactly do you get the second good copy of the data?


From the first. And if it is already bad, as noted previously, this
is no worse than the UFS/ext3 case. If you want total freedom from
this class of errors, use ECC.
 

If you copy the code you've just doubled your chance of using bad memory.
The original copy can be good or bad; the second copy cannot be better
than the first copy.


The whole point is that the memory isn't bad. About once a month, 4GB
of memory of any quality can experience 1 bit being flipped, perhaps
more or less often. If that bit happens to be in the checksummed buffer
then you'll get an unrecoverable error on a mirrored drive. And if I
understand correctly, ZFS keeps data in memory for a lot longer than
other file systems and uses more memory doing so. Good features, but
makes it more vulnerable to random bit flips. This is why decent
machine have ECC. To argue that ZFS should work reliably on machines
without ECC flies in the face of statistical reality and the reason
for ECC in the first place.


You can disable the checksums if you don't care.


But I do care. I'd like to know if my files have been corrupted, or at
least as much as possible. But there are huge classes of files for
which the odd flipped bit doesn't matter and the loss of which would
be very painful. Email archives and videos come to mind. An easy
workaround is to simply store all important stuff on a machine with
ECC. Problem solved...


One broken bit may not have cause serious damage most things work.


Exactly.


Absolutely, memory diags are essential. And you certainly run them if
you see unexpected behaviour that has no other obvious cause.

Runs for days, as noted.


Doesn't proof anything.


Quite. But nonetheless, the unrecoverable errors did occur on mirrored
drives and it seems to defeat the whole purpose of mirroring, which is
AFAIK, keeping two independent copies of every file in case one gets lost.
Writing both images from one buffer appears to violate the premise. I
can think of two RFEs

1) Add an option to buffer writes on machines without ECC memory to
   avoid the possibility of random memory flips causing unrecoverable
   errors with mirrored drives.

2) An option to read files even if they have failed checksums.

1) could be fixed in the documentation - ZFS should be used with caution
on machines with no ECC since random bit flips can cause unrecoverable
checksum failures on mirrored drives. Or ZFS isn't supported on
machines with memory that has no ECC.

Disabling checksums is one way of working around 2). But it also disables
a cool feature. I suppose you could optionally change checksum failure
from an error to a warning, but ideally it would be file by file...

Ironically, I wonder if this is even a problem with raidz? But grotty
machines like these can't really support 3 or more internal drives...

Cheers -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Errors on mirrored drive

2009-05-25 Thread Frank Middleton


On 05/22/09 21:08, Toby Thain wrote:

Yes, the important thing is to *detect* them, no system can run reliably
with bad memory, and that includes any system with ZFS. Doing nutty
things like calculating the checksum twice does not buy anything of
value here.


All memory is bad if it doesn't have ECC. There are only varying
degrees of badness. Calculating the checksum twice on its own would
be nutty, as you say, but doing so on a separate copy of the data
might prevent unrecoverable errors after writes to mirrored drives.
You can't detect memory errors if you don't have ECC. But you can
try to mitigate them. Without doing so makes ZFS less reliable than
the memory it is running on. The problem is that ZFS makes any file
with a bad checksum inaccessible, even if one really doesn't care
if the data has been corrupted. A workaround might be a way to allow
such files to be readable despite the bad checksum...

In hindsight I probably should have merely reported the problem and
left those with more knowledge to propose a solution. Oh well.
 

If the memory is this bad then applications will be dying all over the
place, compilers will be segfaulting, and databases will be writing bad
data even before it reaches ZFS.


But it isn't. Applications aren't dying, compilers are not segfaulting
(it was even possible to compile GCC 4.3.2 with the supplied gcc); gdm
is staying up for weeks at a time... And I wouldn't consider running a
non-trivial database application on a machine without ECC.


Absolutely, memory diags are essential. And you certainly run them if
you see unexpected behaviour that has no other obvious cause.


Runs for days, as noted.
 

Your logic is rather tortuous. If the hardware is that crappy then
there's not much ZFS can do about it.


Well, it could. For example, it could make copies of the data before
checksumming so that one memory hit doesn't result in an unrecoverable
file on a mirrored drive. Either that or there's a bug in ZFS. I am
more inclined to blame the memory, especially since the failure rate
isn't much higher than the expected rate as reported elsewhere.


Maybe this should be a new thread, but I suspect the following
proves that the problem must be memory, and that begs the question
as to how memory glitches can cause fatal ZFS checksum errors.


Of course they can; but they will also break anything else on the machine.


But they don't. Checksum errors are reasonable, but not unrecoverable
ones on mirrors.
 

How can a machine with bad memory work fine with ext3?


It does. It works fine with ZFS too. Just really annoying unrecoverable
files every now and then on mirrored drives. This shouldn't happen even
with lousy memory and wouldn't (doesn't) with ECC. If there was a way
to examine the files and their checksums, I would be surprised if they
were different (If they were, it would almost certainly be the controller
or the PCI bus itself causing the problem). But I speculate that it is
predictable memory hits.

-- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Errors on mirrored drive

2009-05-22 Thread Frank Middleton


There have been a number of threads here on the reliability of ZFS in the
face of flaky hardware. ZFS certainly runs well on decent (e.g., SPARC)
hardware, but isn't it reasonable to expect it to run well on something
less well engineered? I am a real ZFS fan, and I'd hate to see folks
trash it because it appears to be unreliable.

In an attempt to bolster the proposition that there should at least be
an option to buffer the data before checksumming and writing, we've
been doing a lot of testing on presumed flaky (cheap) hardware, with a
peculiar result - see below.

On 04/21/09 12:16, casper@sun.com wrote:
 

And so what?  You can't write two different checksums; I mean, we're
mirroring the data so it MUST BE THE SAME.  (A different checksum would be
wrong: I don't think ZFS will allow different checksums for different
sides of a mirror)


Unless it does a read after write on each disk, how would it know that
the checksums are the same? If the data is damaged before the checksum
is calculated then it is no worse than the ufs/ext3 case. If data +
checksum is damaged whilst the (single) checksum is being calculated,
or after, then the file is already lost before it is even written!
There is a significant probability that this could occur on a machine
with no ecc. Evidently memory concerns /are/ an issue - this thread
http://opensolaris.org/jive/thread.jspa?messageID=338148 even suggests
including a memory diagnostic with the distribution CD (Fedora already
does so).

Memory diagnostics just test memory. Disk diagnostics just test disks.
ZFS keeps disks pretty busy, so perhaps it loads the power supply
to the point where it heats up and memory glitches are more likely.
It might also explain why errors don't really begin until ~15 minutes
after the busy time starts.

You might argue that this problem could only affect systems doing a
lot of disk i/o and such systems probably have ecc memory. But doing
an o/s install is the one time where a consumer grade computer does
a *lot* of disk i/o for quite a long time and is hence vulnerable.
Ironically,  the Open Solaris installer does not allow for ZFS
mirroring at install time, one time where it might be really important!
Now that sounds like a more useful RFE, especially since it would be
relatively easy to implement. Anaconda does it...

A Solaris install writes almost 4*10^10 bits. Quoting Wikipedia, look
at Cypress on ECC, see http://www.edn.com/article/CA454636.html.
Possibly, statistically likely random memory glitches could actually
explain the error rate that is occurring.


You are assuming that the error is the memory being modified after
computing the checksums; I would say that that is unlikely; I think it's a
bit more likely that the data gets corrupted when it's handled by the disk
controller or the disk itself.  (The data is continuously re-written by
the DRAM controller)


See below for an example where a checksum error occurs without the
disk subsystem being involved. There seems to be no other plausible
explanation other than an improbable bug in X86 ZFS itself.


It would have been nice if we were able to recover the contents of the
file; if you also know what was supposed to be there, you can diff and
then we can find out what was wrong.


file on those files resulted in bus error. Is there a way to actually
read a file reported by ZFS as unrecoverable to do just that (and to
separately retrieve the copy from each half of the mirror)?

Maybe this should be a new thread, but I suspect the following
proves that the problem must be memory, and that begs the question
as to how memory glitches can cause fatal ZFS checksum errors.

Here is the peculiar result (same machine)

After several attempts, succeeded in doing a zfs send to a file
on a NFS mounted ZFS file system on another machine (SPARC)
followed by a ZFS recv of that file there. But every attempt to
do a ZFS recv of the same snapshot (i.e., from NFS) on the local
machine (X86) has failed with a checksum mismatch. Obviously,
the file is good, since it was possible to do a zfs recv from it.
You can't blame the IDE drivers (or the bus, or the disks) for
this. Similarly, piping the snapshot though SSH fails, so you
can't blame NFS either. Something is happening to cause checksum
failures between after when the data is received by the PC and
when ZFS computes its checksums. Surely this is either a highly
repeatable memory glitch, or (most unlikely) a bug in X86 ZFS.
ZFS recv to another SPARC over SSH to the same physical disk
(accessed via a sata/pata adapter) was also successful.

Does this prove that the data+checksum is being corrupted by
memory glitches? Both NFS and SSH over TCP/IP provide reliable
transport (via checksums), so the data is presumably received
correctly. ZFS then calculates its own checksum and it fails.
Oddly, it /always/ fails, but not at the same point, and far
into the stream when both disks have been very busy for a while.

It would be interesting to see if the

Re: [zfs-discuss] Errors on mirrored drive

2009-04-21 Thread Frank Middleton


On 04/17/09 12:37, casper@sun.com wrote:

I'd like to submit an RFE suggesting that data + checksum be copied for
mirrored writes, but I won't waste anyone's time doing so unless you
think there is a point. One might argue that a machine this flaky should
be retired, but it is actually working quite well, and perhaps represents
not even the extreme of bad hardware that ZFS might encounter.


I think it's a stupid idea.  If you get two checksums, what can you do?
The second copy is most likely suspect and you double your chance that you
use bad memory.

Casper


If there were permanently bad memory locations, surely the diagnostics
would reveal them. Here's an interesting paper on memory errors:
http://www.ece.rochester.edu/~mihuang/PAPERS/hotdep07.pdf
Given the inevitability of relatively frequent transient memory
errors, I would think it behooves the file system to minimize the
effects of such errors. But I won't belabor the point except to
suggest that the cost of adding the suggested step would not be
very expensive (either to implement or run).

Memory diagnostics ran for a full 12 hours with no errors. Same goes
for both disks, using Solaris format/ana/verify. So far, after
creating 400,000 files, two files had permanent, apparently truly
unrecoverable errors and could not be read by anything.

Now it gets really funky. I detached one of the disks, and then found
it couldn't be reattached. Turns out there is a rounding problem with
Solaris fdisk (run from format) that can cause identical partitions on
identical disks to have different sizes. I used the Linux sfdisk
utility to repair the MBR and fix the Solaris2 partition sizes. Then
it was possible to reattach the disk. Unfortunately it wasn't possible
to boot from the result, but a reinstall went perfectly with no ZFS
errors being reported at all. So it appears that the problem may be
with the OpenSolaris fdisk. Is this worth reporting as a bug? It is
likely to be quite hard to reproduce...



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Errors on mirrored drive

2009-04-15 Thread Frank Middleton


Experimenting with OpenSolaris on an elderly PC with equally
elderly drives, zpool status shows errors after a pkg image-update
followed by a scrub. It is entirely possible that one of these
drives is flaky, but surely the whole point of a zfs mirror is
to avoid this? It seems unlikely that both drives failed at the
same time. Could someone explain how this can happen? Another
question (perhaps for the indiana folks) is how to restore these
files?

# zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed after 0h24m with 2 errors on Wed Apr 15 09:15:40 2009
config:

NAMESTATE READ WRITE CKSUM
rpool   ONLINE   0 069
  mirrorONLINE   0 0   144
c3d0s0  ONLINE   0 0   145  128K repaired
c3d1s0  ONLINE   0 0   151  168K repaired

errors: Permanent errors have been detected in the following files:

//lib/amd64/libsec.so.1
//lib/libdlpi.so.1
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Errors on mirrored drive

2009-04-15 Thread Frank Middleton


On 04/15/09 14:30, Bob Friesenhahn wrote:

On Wed, 15 Apr 2009, Frank Middleton wrote:

zpool status shows errors after a pkg image-update
followed by a scrub.



If a corruption occured in the main memory, the backplane, or the disk
controller during the writes to these files, then the original data
written could be corrupted, even though you are using mirrors. If the
system experienced a physical shock, or power supply glitch, while the
data was written, then it could impact both drives.


Quite. Sounds like an architectural problem. This old machine probably
doesn't have ecc memory (AFAIK still rare on most PCs), but it is on
a serial UPS and isolated from shocks, and this has happened more
than once. These drives on this machine recently passed both the purge
and verify cycles (format/analyze) several times. Unless the data is
written to both drives from the same buffer and checksum (surely not!),
it is still unclear how it could get written to *both* drives with a
bad checksum. It looks like the files really are bad - neither of
them can be read - unless ZFS sensibly refuses to allow possibly good
files with bad checksums to be read (cannot read: I/O error).

BTW fmdump -ev doesn't seem to report any disk errors  at all.

So my question remains - even with the grottiest hardware, how can
several files get written with bad checksums to mirrored drives? ZFS
has so many cool features this would be easy to live with if there
was a reasonably simple way to get copies of these files to restore
them, short of getting the source and recompiling, or pkg uninstall
followed by install (if you can figure out which pkg(s) the bad files
are in), but it seems to defeat the purpose of softwaremirroring...







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] jigdo or lofi can crash nfs+zfs

2009-04-06 Thread Frank Middleton


These problems both occur when accessing a ZFS dataset from
Linux (FC10) via NFS.

Jigdo is a fairly new bit-torrent-like downloader. It is not
entirely bug free, and the one time I tried it, it recursively
downloaded one directory's worth until ZFS eventually sort
of died. It put all the disks into error, and even the (UFS)
root disks became unreadable. It took a reboot to free everything
up and some twiddling to get ZFS going again. I really don't
want to even try to reproduce this! With 4GB physical, 10GB swap,
and almost 3TB of raidz, it probably didn't run out of memory
or disk space. There wasn't room on the boot disks to save the
crash dump after halt, sync. Is there any point in submitting
a bug report, and if so, what would you call it?

Is there a practical way to force the crash dump to go to a ZFS
dataset instead of the UFS boot disks?

Also, there is a reasonably reproducible problem that causes
a panic doing an NFS network install when the DVD image is copied
to a ZFS dataset on snv103. I submitted this as a bug report to
bugs.opensolaris.org, and it was acknowledged, but then it vanished.
This is actually an NFS/ZFS problem, so maybe it was applied
against the wrong group, or perhaps this was a transition issue.
I wasn't able to get a crash core saved because there wasn't
enough space on the boot (UFS) disks. I do have the panic traces
for the 3 times I reproduced this. Should this be resubmitted to
defect.opensolaris.org, and if so, against what? This problem
doesn't happen of the DVD image is itself mounted via NFS, or
is on a UFS partition.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can this be done?

2009-03-29 Thread Frank Middleton


On 03/29/09 11:58, David Magda wrote:

On Mar 29, 2009, at 00:41, Michael Shadle wrote:


Well I might back up the more important stuff offsite. But in theory
it's all replaceable. Just would be a pain.


And what is the cost of the time to replace it versus the price of a
hard disk? Time ~ money.


So what is best if you get a 4th drive for a 3 drive raidz? Is it
better to keep it separate and use it for backups of the replaceable
data (perhaps on a different machine), as a hot spare, second parity,
or something else? Seems so un-green to have it spinning uselessly :-)
LTO-4 tape drives at $200 just for the media? I guess not...
 

There used to be a time when I like fiddling with computer parts. I now
have other, more productive ways of wasting my time. :)


Quite.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Growing a zpool mirror breaks on Adaptec 1205sa PCI

2009-03-28 Thread Frank Middleton


On 03/28/09 20:01, Harry Putnam wrote:


Finding a sataII card is proving to be very difficult.  The reason is
that I only have PCI no PCI express.  I haven't see a single one
listed as SATAII compatible and have spent a bit time googling.


It's even worse if you have an old SPARC system. We've had great results
with some LSI LOGIC SAS3041XL-S cards we got on E-Bay in conjunction
with 3x1.5TB Seagate drives, for 2.7TiB of raidz. The combination proved
faster than mirrored 10,000 RPM SCSI disks using UFS, in an unscientific
benchmark (bonnie). I don't think this LSI controller is SATA II, but
it has no problems with the 1.5TB Seagates; they are Sun branded cards
and worked right of of the box in 3.3V 66MHz PCI slots. 2.7TiB of raidz
for around $400. Amazing... And ZFS is just plain incredible - makes
every other file system look so antiquated :-)

Hope this helps -- Frank
 
 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

92 matches

Mail list logo