[zfs-discuss] zpool scrub bad block list

2011-11-08 Thread Didier Rebeix
Hi list,

from ZFS documentation it appears unclear to me if a zpool
scrub will black list any found bad blocks so they won't be used
anymore.

I know Netapp's WAFL scrub does reallocate bad blocks and mark them as
unsable. Does ZFS have this kind of strategy ?

Thanks.

-- 
Didier
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool scrub bad block list

2011-11-08 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Didier Rebeix
 
   from ZFS documentation it appears unclear to me if a zpool
 scrub will black list any found bad blocks so they won't be used
 anymore.

If there are any physically bad blocks, such that the hardware (hard disk)
will return an error every time that block is used, then the disk should be
replaced.  All disks have a certain amount of error detection/correction
built in, and remap bad blocks internally and secretly behind the scenes,
transparent to the OS.  So if there are any blocks regularly reporting bad
to the OS, then it means there is a growing problem inside the disk.
Offline the disk and replace it.

It is ok to get an occasional cksum error.  Say, once a year.  Because the
occasional cksum error will be re-read and as long as the data is correct
the second time, no problem.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool scrub bad block list

2011-11-08 Thread Andrew Gabriel
ZFS detects far more errors that traditional filesystems will simply miss. 
This means that many of the possible causes for those errors will be 
something other than a real bad block on the disk. As Edward said, the disk 
firmware should automatically remap real bad blocks, so if ZFS did that 
too, we'd not use the remapped block, which is probably fine. For other 
errors, there's nothing wrong with the real block on the disk - it's going 
to be firmware, driver, cache corruption, or something else, so 
blacklisting the block will not solve the issue. Also, with some types of 
disk (SSD), block numbers are moved around to achieve wear leveling, so 
blacklistinng a block number won't stop you reusing that real block.


--
Andrew Gabriel (from mobile)

--- Original message ---
From: Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com

To: didier.reb...@u-bourgogne.fr, zfs-discuss@opensolaris.org
Sent: 8.11.'11,  12:50


From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Didier Rebeix

 from ZFS documentation it appears unclear to me if a zpool
scrub will black list any found bad blocks so they won't be used
anymore.


If there are any physically bad blocks, such that the hardware (hard 
disk)
will return an error every time that block is used, then the disk should 
be

replaced.  All disks have a certain amount of error detection/correction
built in, and remap bad blocks internally and secretly behind the scenes,
transparent to the OS.  So if there are any blocks regularly reporting 
bad

to the OS, then it means there is a growing problem inside the disk.
Offline the disk and replace it.

It is ok to get an occasional cksum error.  Say, once a year.  Because 
the

occasional cksum error will be re-read and as long as the data is correct
the second time, no problem.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool scrub bad block list

2011-11-08 Thread Didier Rebeix
Very interesting... I didn't know disk firwares were responsible for
automagically relocating bad blocks. Knowing this, it makes no sense for
a filesystem to try to deal with this kind of errors.

For now, any disk with read/write errors detected will be discarded
from my filers and replaced...

Thanks !

Le Tue, 08 Nov 2011 13:03:57 +,
Andrew Gabriel andrew.gabr...@oracle.com a écrit :

 ZFS detects far more errors that traditional filesystems will simply
 miss. This means that many of the possible causes for those errors
 will be something other than a real bad block on the disk. As Edward
 said, the disk firmware should automatically remap real bad blocks,
 so if ZFS did that too, we'd not use the remapped block, which is
 probably fine. For other errors, there's nothing wrong with the real
 block on the disk - it's going to be firmware, driver, cache
 corruption, or something else, so blacklisting the block will not
 solve the issue. Also, with some types of disk (SSD), block numbers
 are moved around to achieve wear leveling, so blacklistinng a block
 number won't stop you reusing that real block.
 


-- 
Didier REBEIX
Universite de Bourgogne
Direction des Systèmes d'Information
BP 27877
21078 Dijon Cedex
Tel: +33 380395205


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs sync=disabled property

2011-11-08 Thread Evaldas Auryla

 Hi all,

I'm trying to evaluate what are the risks of running NFS share of zfs 
dataset with sync=disabled property. The clients are vmware hosts in our 
environment and server is SunFire X4540 Thor system. Though general 
recommendation tells not to do this, but after testing performance with 
default setting and sync=disabled - it's night and day, so it's really 
tempting to do sync=disabled ! Thanks for any suggestion.


Best regards,

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs sync=disabled property

2011-11-08 Thread Garrett D'Amore

On Nov 8, 2011, at 6:38 AM, Evaldas Auryla wrote:

 Hi all,
 
 I'm trying to evaluate what are the risks of running NFS share of zfs dataset 
 with sync=disabled property. The clients are vmware hosts in our environment 
 and server is SunFire X4540 Thor system. Though general recommendation 
 tells not to do this, but after testing performance with default setting and 
 sync=disabled - it's night and day, so it's really tempting to do 
 sync=disabled ! Thanks for any suggestion.

The risks are, any changes your software clients expect to be written to disk 
-- after having gotten a confirmation that they did get written -- might not 
actually be written if the server crashes or loses power for some reason.

You should consider a high performance low-latency SSD (doesn't have to be very 
big) as an SLOG… it will do a lot for your performance without having to give 
up the commit guarantees that you lose with sync=disabled.

Of course, if the data isn't precious to you, then running with sync=disabled 
is probably ok.  But if you love your data, don't do it.

- Garrett

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs sync=disabled property

2011-11-08 Thread David Magda
On Tue, November 8, 2011 09:38, Evaldas Auryla wrote:
 I'm trying to evaluate what are the risks of running NFS share of zfs
 dataset with sync=disabled property. The clients are vmware hosts in our
 environment and server is SunFire X4540 Thor system. Though general
 recommendation tells not to do this, but after testing performance with
 default setting and sync=disabled - it's night and day, so it's really
 tempting to do sync=disabled ! Thanks for any suggestion.

You may want to examine getting some good SSDs and attaching them as
(mirrored?) slog devices instead:


http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Separate_Log_Devices

You probably want to zpool version of 22 or better to do this, as from
that point onward it becomes possible to remove the slog device/s if
desired. Previous to v22 once you add them you're stuck with them.

Some interesting benchmarks on offloading the ZIL can be found at:

https://blogs.oracle.com/brendan/entry/slog_screenshots

Your SSD/s don't have to be that large either: by default the ZIL can be
at most 50% of RAM, so if your server has (say) 48 GB of RAM, then the an
SSD larger than 24 GB would really be a bit of a waste (though you can use
the 'extra' space as L2ARC perhaps). Given that, it's probably better
value to get a faster SLC SSD that's smaller, rather than a 'cheaper' MLC
that's larger.

Past discussions on zfs-discuss have favourably mentioned devices based on
the SandForce SF-1500 and SF-2500/2600 chipsets (they often come with
supercaps and such). Intel's 311 could be another option.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool scrub bad block list

2011-11-08 Thread Paul Kraus
On Tue, Nov 8, 2011 at 9:14 AM, Didier Rebeix
didier.reb...@u-bourgogne.fr wrote:

 Very interesting... I didn't know disk firwares were responsible for
 automagically relocating bad blocks. Knowing this, it makes no sense for
 a filesystem to try to deal with this kind of errors.

In the dark ages, hard drives came with bad block lists taped to
them so you could load them into the device driver for that drive. New
bad blocks would be mapped out by the device driver. All that
functionality was moved into the drive a long time ago (at least 10-15
years).

Under Solaris, you can see the size of the bad block lists through
FORMAT - DEFECT - PRIMARY will give you the size of the list from
the factory and FORMAT - DEFECT - GROWN will give you those added
since the drive left the factory. I tend to open a support case to
have a drive replaced if the GROWN list is much above 0 or is growing.

Keep in mind that any type of hardware RAID should report back 0
for both to the OS.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
- Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
- Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
- Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Couple of questions about ZFS on laptops

2011-11-08 Thread Jim Klimov

Hello all,

  I am thinking about a new laptop. I see that there are
a number of higher-performance models (incidenatlly, they
are also marketed as gamer ones) which offer two SATA
2.5 bays and an SD flash card slot. Vendors usually
position the two-HDD bay part as either get lots of
capacity with RAID0 over two HDDs, or get some capacity
and some performance by mixing one HDD with one SSD.
Some vendors go as far as suggesting a highest performance
with RAID0 over two SSDs.

  Now, if I were to use this for work with ZFS on an
OpenSolaris-descendant OS, and I like my data enough
to want it mirrored, but still I want an SSD performance
boost (i.e. to run VMs in real-time), I seem to have
a number of options:

1) Use a ZFS mirror of two SSDs
   - seems too pricey
2) Use a HDD with redundant data (copies=2 or mirroring
   over two partitions), and an SSD for L2ARC (+maybe ZIL)
   - possible unreliability if the only HDD breaks
3) Use a ZFS mirror of two HDDs
   - lowest performance
4) Use a ZFS mirror of two HDDs and an SD card for L2ARC.
   Perhaps add another built-in flash card with PCMCIA
   adapters for CF, etc.

Now, there is a couple of question points for me here.

One was raised in my recent questions about CF ports in a
Thumper. The general reply was that even high-performance
CF cards are aimed for linear RW patterns and may be
slower than HDDs for random access needed as L2ARCs, so
flash cards may actually lower the system performance.
I wonder if the same is the case with SD cards, and/or
if anyone encountered (and can advise) some CF/SD cards
with good random access performance (better than HDD
random IOPS). Perhaps an extra IO path can be beneficial
even if random performances are on the same scale - HDDs
would have less work anyway and can perform better with
their other tasks?

On another hand, how would current ZFS behave if someone
ejects an L2ARC device (flash card) and replaces it with
another unsuspecting card, i.e. one from a photo camera?
Would ZFS automatically replace the L2ARC device and
kill the photos, or would the cache be disabled with
no fatal implication for the pools nor for the other
card? Ultimately, when the ex-L2ARC card gets plugged
back in, would ZFS automagically attach it as the cache
device, or does this have to be done manually?


Second question regards single-HDD reliability: I can
do ZFS mirroring over two partitions/slices, or I can
configure copies=2 for the datasets. Either way I
think I can get protection from bad blocks of whatever
nature, as long as the spindle spins. Can these two
methods be considered equivalent, or is one preferred
(and for what reason)?


Also, how do other list readers place and solve their
preferences with their OpenSolaris-based laptops? ;)

Thanks,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Single-disk rpool with inconsistent checksums, import fails

2011-11-08 Thread Jim Klimov

Hello all,

I have an oi_148a PC with a single root disk, and since
recently it fails to boot - hangs after the copyright
message whenever I use any of my GRUB menu options.

Booting with an oi_148a LiveUSB I had around since
installation, I ran some zdb traversals over the rpool
and zpool import attempts. The imports fail by running
the kernel out of RAM (as recently discussed in the
list with Paul Kraus's problems).

However, in my current case, the rpool has just 11.2Gb
allocated with 8.7Gb available. So almost all of it
could fit in the 8Gb RAM of this computer (no more can
be placed into the motherboard). And I don't believe
there is so much metadata as to exhaust the RAM during
an import attempt.

I have also tried rollback imports with -F, but they
have also failed so far.

I am not ready to copypaste the zdb/zpool outputs here
(I have to get text files off that box), but in short:

1) zdb -bsvL -e rpool-GUID showed that there are some
problems:
* deferred free block count is not zero, although small
  (144 blocks amounting to 1.4Mbytes), and it remained at
  this value over several import attempts.
  I have removed a swap volume some time before the failure,
  so this might be its leftovers.
* It had also output this line:
block traversal size 11986202624 != alloc 11986203136 (unreachable 512)
  I believe this refers to the allocated data size in bytes,
  and that one sector (512b) is deemed unreachable. Is that
  so fatal?

2) zdb -bsvc -e rpool-GUID showed that there are some
consistency problems. Namely, five blocks had mismatching
checksums. They were named plain file blocks with no
further details (like what files they might be parts of).
But I hope that this means no metadata was hurt so far.

3) I've tried importing the pool in several ways (including
normal and rollback mounts, readonly and -n), but so far
all attempts led to to the computer hanging within a minute
(vmstat 1 shows that free RAM plummets towards the zero
mark).

I've tried preparing the system tunables as well:

:; echo aok/W 1 | mdb -kw
:; echo zfs_recover/W 1 | mdb -kw

and sometimes adding:
:; echo zfs_vdev_max_pending/W0t5 | mdb -kw
:; echo zfs_resilver_delay/W0t0 | mdb -kw
:; echo zfs_resilver_min_time_ms/W0t2 | mdb -kw
:; echo zfs_txg_synctime/W0t1 | mdb -kw


In this case I am not very hesitant to recreate the rpool
and reinstall the OS - it was mostly needed to server the
separate data pool. However this option is not always an
acceptable one, so I wonder if anything can be done to
repair an inconsistent non-redundant pool - at least to
make it importable again in order to evacuate some of the
settings and tunings that I've made over time.

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Couple of questions about ZFS on laptops

2011-11-08 Thread Bob Friesenhahn

On Tue, 8 Nov 2011, Jim Klimov wrote:


Second question regards single-HDD reliability: I can
do ZFS mirroring over two partitions/slices, or I can
configure copies=2 for the datasets. Either way I
think I can get protection from bad blocks of whatever
nature, as long as the spindle spins. Can these two
methods be considered equivalent, or is one preferred
(and for what reason)?


Using two partitions on the same disk seems to give you most of the 
headaches associated with more disks without much of the benefit.  If 
there is any minor issue, you will see zfs resilvering partitions and 
resilvering will be slow due to the drive heads flailing back and 
forth between partitions. There is also the issue that the block 
allocation is not likely to be very efficient in terms of head 
movement if two partitions are used.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Single-disk rpool with inconsistent checksums, import fails

2011-11-08 Thread Jim Klimov

2011-11-08 22:30, Jim Klimov wrote:

Hello all,

I have an oi_148a PC with a single root disk, and since
recently it fails to boot - hangs after the copyright
message whenever I use any of my GRUB menu options.


Thanks to my wife's sister, who is my hands and eyes near
the problematic PC, here's some ZDB output from this rpool:

# zpool import
  pool: rpool
id: 17995958177810353692
 state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

rpool   ONLINE
  c4t1d0s0  ONLINE


So here it is - a single-device rpool.
There are some on-disk errors, so some of zdb walks fail:


root@openindiana:~# time zdb -bb -e 17995958177810353692

Traversing all blocks to verify nothing leaked ...
Assertion failed: ss-ss_start = start (0x79e22600 = 0x79e1dc00), file 
../../../uts/common/fs/zfs/space_map.c, line 173

Abort (core dumped)

real0m12.184s
user0m0.367s
sys 0m0.474s

root@openindiana:~# time zdb -bsvc -e 17995958177810353692

Traversing all blocks to verify checksums and verify nothing leaked ...
Assertion failed: ss-ss_start = start (0x79e22600 = 0x79e1dc00), file 
../../../uts/common/fs/zfs/space_map.c, line 173

Abort (core dumped)

real0m12.019s
user0m0.360s
sys 0m0.458s



However -bsvL and -bsvcL (with checksum-checks) do finish,
results of the former test (more complete) are listed below:



root@openindiana:~# time zdb -bsvcL -e 17995958177810353692

Traversing all blocks to verify checksums ...

zdb_blkptr_cb: Got error 50 reading 182, 19177, 0, 1 
DVA[0]=0:a8c8e600:2 [L0 ZFS plain file] fletcher4 uncompressed LE 
contiguous unique single size=2L/2P birth=82L/82P fill=1 
cksum=3401f5fe522b:109ee10ba48ed38c:e7f49c220f7b8bc:ff405ef051b91e65 -- 
skipping
zdb_blkptr_cb: Got error 50 reading 182, 19202, 0, 1 
DVA[0]=0:a9030a00:2 [L0 ZFS plain file] fletcher4 uncompressed LE 
contiguous unique single size=2L/2P birth=82L/82P fill=1 
cksum=11c4c738b0ba:7bb81bce3313913:8f85a7abf1b9e34:58e8746d63119393 -- 
skipping
zdb_blkptr_cb: Got error 50 reading 182, 24924, 0, 0 
DVA[0]=0:b1aaec00:14a00 [L0 ZFS plain file] fletcher4 uncompressed LE 
contiguous unique single size=14a00L/14a00P birth=85L/85P fill=1 
cksum=270679cd905d:6119a969a134566:6f0f7da64c4d2d90:3ab86aa985abef02 -- 
skipping
zdb_blkptr_cb: Got error 50 reading 182, 24944, 0, 0 
DVA[0]=0:b1cdf000:10800 [L0 ZFS plain file] fletcher4 uncompressed LE 
contiguous unique single size=10800L/10800P birth=85L/85P fill=1 
cksum=1ebb4d1ae9f5:3cf5f42afa9a332:757613fc2d2de7b3:5f197017333a4f89 -- 
skipping


zdb_blkptr_cb: Got error 50 reading 493, 947, 0, 165 
DVA[0]=0:b3efc200:2 [L0 ZFS plain file] fletcher4 uncompressed LE 
contiguous unique single size=2L/2P birth=26691L/26691P fill=1 
cksum=2cdc2ae22d10:b33d31bcbc0d8da:f1571c9975e151b0:a037073594569635 -- 
skipping


Error counts:

errno  count
   50  5
block traversal size 11986202624 != alloc 11986203136 (unreachable 512)

bp count:  405927
bp logical:15030449664  avg:  37027
bp physical:   12995855872  avg:  32015 compression:   1.16
bp allocated:  13172434944  avg:  32450 compression:   1.14
bp deduped:1186232320ref1:  12767   deduplication:   1.09
SPA allocated: 11986203136 used: 56.17%

Blocks  LSIZE   PSIZE   ASIZE avgcomp   %Total  Type
 -  -   -   -   -   --  unallocated
 232K  4K   12.0K   6.00K8.00 0.00  object directory
 3  1.50K   1.50K   4.50K   1.50K1.00 0.00  object array
 116K   1.50K   4.50K   4.50K   10.67 0.00  packed nvlist
 -  -   -   -   -   --  packed nvlist size
   197  24.2M   1.87M   5.61M   29.2K   12.92 0.04  bpobj
 -  -   -   -   -   --  bpobj header
 -  -   -   -   -   --  SPA space map 
header

 1.27K  6.79M   3.25M9.8M   7.70K2.09 0.08  SPA space map
 8   144K144K144K   18.0K1.00 0.00  ZIL intent log
 26.6K   426M   91.1M182M   6.86K4.67 1.45  DMU dnode
75   150K   39.0K   80.0K   1.07K3.85 0.00  DMU objset
 -  -   -   -   -   --  DSL directory
23  12.0K   11.5K   34.5K   1.50K1.04 0.00  DSL directory 
child map
21  11.5K   10.5K   31.5K   1.50K1.10 0.00  DSL dataset 
snap map

49   707K   79.5K239K   4.87K8.89 0.00  DSL props
 -  -   -   -   -   --  DSL dataset
 -  -   -   -   -   --  ZFS znode
 -  -   -   -   -   --  ZFS V0 ACL
  321K  12.0G   10.5G   10.5G   33.4K1.1485.46  ZFS plain file
 26.8K  41.5M   19.1M   38.2M   1.42K 

Re: [zfs-discuss] Couple of questions about ZFS on laptops

2011-11-08 Thread Jim Klimov

2011-11-08 23:36, Bob Friesenhahn wrote:

On Tue, 8 Nov 2011, Jim Klimov wrote:


Second question regards single-HDD reliability: I can
do ZFS mirroring over two partitions/slices, or I can
configure copies=2 for the datasets. Either way I
think I can get protection from bad blocks of whatever
nature, as long as the spindle spins. Can these two
methods be considered equivalent, or is one preferred
(and for what reason)?


Using two partitions on the same disk seems to give you most of the
headaches associated with more disks without much of the benefit. If
there is any minor issue, you will see zfs resilvering partitions and
resilvering will be slow due to the drive heads flailing back and forth
between partitions. There is also the issue that the block allocation is
not likely to be very efficient in terms of head movement if two
partitions are used.


Thanks, Bob, I figured so...
And would copies=2 save me from problems of data loss and/or
inefficient resilvering? Does all required data and metadata
get duplicated this way, so any broken sector can be amended?
I read on this list recently, that some metadata is already
copies=2 or =3. To what extent?.. Should the trunk of the
ZFS block tree be expected always secured, even on one disk?

Thanks,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Couple of questions about ZFS on laptops

2011-11-08 Thread Bob Friesenhahn

On Wed, 9 Nov 2011, Jim Klimov wrote:


Thanks, Bob, I figured so...
And would copies=2 save me from problems of data loss and/or
inefficient resilvering? Does all required data and metadata
get duplicated this way, so any broken sector can be amended?
I read on this list recently, that some metadata is already
copies=2 or =3. To what extent?.. Should the trunk of the
ZFS block tree be expected always secured, even on one disk?


With only one disk partition in a vdev, then there will be no 
resilvering since there is nothing to resilver.  Metadata has always 
stored at least two copies.  It is always possible to lose the whole 
pool if the device does not work according to specification (or you 
drop the laptop on the ground).  Using copies=2 and doing a 'zfs 
scrub' at least once after bulk data has been written should help 
avoid media read errors.  Zfs will still resilver blocks which 
failed to read as long as there is a redundant copy.


If you do want to increase reliability then you should mirror between 
disks, even if you feel that this will be slow.  It will still be 
faster (for reads) than using just one disk.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris Based Systems Lock Up - Possibly ZFS/memory related?

2011-11-08 Thread Lachlan Mulcahy
Hi All,


On Wed, Nov 2, 2011 at 5:24 PM, Lachlan Mulcahy
lmulc...@marinsoftware.comwrote:

 Now trying another suggestion sent to me by a direct poster:

 *   Recommendation from Sun (Oracle) to work around a bug:
 *   6958068 - Nehalem deeper C-states cause erratic scheduling
 behavior
 set idle_cpu_prefer_mwait = 0
 set idle_cpu_no_deep_c = 1

 Was apparently the cause of a similar symptom for them and we are using
 Nehalem.

 At this point I'm running out of options, so it can't hurt to try it.


 So far the system has been running without any lock ups since very late
 Monday evening -- we're now almost 48 hours on.

 So far so good, but it's hard to be certain this is the solution, since I
 could never prove it was the root cause.

 For now I'm just continuing to test and build confidence level. More time
 will make me more confident. Maybe a week or so


We're now over a week running with C-states disabled and have not
experienced any further system lock ups. I am feeling much more confident
in this system now -- it will probably see at least another week or two in
addition to more load/QA testing and then be pushed into production.

Will update if I see the issue crop up again, but for anyone else
experiencing a similar symptom, I'd highly recommend trying this as a
solution.

So far it seems to have worked for us.

Regards,
-- 
Lachlan Mulcahy
Senior DBA,
Marin Software Inc.
San Francisco, USA

AU Mobile: +61 458 448 721
US Mobile: +1 (415) 867 2839
Office : +1 (415) 671 6080
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-11-08 Thread Jim Klimov

Hello all,

  A couple of months ago I wrote up some ideas about clustered
ZFS with shared storage, but the idea was generally disregarded
as not something to be done in near-term due to technological
difficultes.

  Recently I stumbled upon a Nexenta+Supermicro report [1] about
cluster-in-a-box with shared storage boasting an active-active
cluster with transparent failover. Now, I am not certain how
these two phrases fit in the same sentence, and maybe it is some
marketing-people mixup, but I have a couple of options:

1) The shared storage (all 16 disks are accessible to both
   motherboards) is split into two ZFS pools, each mounted
   by one node normally. If a node fails, another imports
   the pool and continues serving it.

2) All disks are aggregated into one pool, and one node
   serves it while another is in hot standby.

   Ideas (1) and (2) may possibly contradict the claim that
   the failover is seamless and transparent to clients.
   A pool import usually takes some time, maybe long if
   fixups are needed; and TCP sessions are likely to get
   broken. Still, maybe the clusterware solves this...


3) Nexenta did implement a shared ZFS pool with both nodes
   accessing all of the data instantly and cleanly.
   Can this be true? ;)


If this is not a deeply-kept trade secret, can the Nexenta
people elaborate in technical terms how this cluster works?

[1] http://www.nexenta.com/corp/sbb?gclid=CIzBg-aEqKwCFUK9zAodCSscsA

Thanks,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-11-08 Thread Daniel Carosone
On Wed, Nov 09, 2011 at 03:52:49AM +0400, Jim Klimov wrote:
   Recently I stumbled upon a Nexenta+Supermicro report [1] about
 cluster-in-a-box with shared storage boasting an active-active
 cluster with transparent failover. Now, I am not certain how
 these two phrases fit in the same sentence, and maybe it is some
 marketing-people mixup,

One way they can not be in conflict, is if each host normally owns 8
disks and is active with it, and standby for the other 8 disks. 

Not sure if this is what the solution in question is doing, just
saying. 

--
Dan.


pgpzJ4iippP0L.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs sync=disabled property

2011-11-08 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Evaldas Auryla
 
 I'm trying to evaluate what are the risks of running NFS share of zfs
 dataset with sync=disabled property. The clients are vmware hosts in our
 environment and server is SunFire X4540 Thor system. Though general
 recommendation tells not to do this, but after testing performance with
 default setting and sync=disabled - it's night and day, so it's really
 tempting to do sync=disabled ! Thanks for any suggestion.

I know a lot of people will say don't do it, but that's only partial
truth.  The real truth is:

At all times, if there's a server crash, ZFS will come back along at next
boot or mount, and the filesystem will be in a consistent state, that was
indeed a valid state which the filesystem actually passed through at some
moment in time.  So as long as all the applications you're running can
accept the possibility of going back in time as much as 30 sec, following
an ungraceful ZFS crash, then it's safe to disable ZIL (set sync=disabled).

In your case, you have vm's inside the ZFS filesystem.  In the event ZFS
crashes ungracefully, you don't want the VM disks to go back in time while
the VM's themselves are unaware anything like that happened.  If you run
with sync=disabled, you want to ensure your ZFS / NFS server doesn't come
back up automatically.  If ZFS crashes, you want to force the guest VM's to
crash.  Force power down the VM's, then bring up NFS, remount NFS, and
reboot the guest VM's.  All the guest VM's will have gone back in time, by
as much as 30 sec.  This is generally acceptable for things like web servers
and file servers and windows VMs in a virtualized desktop environment etc.
It's also acceptable for things running databases, as long as all the DB
clients can go back in time (reboot them whatever).  It is NOT acceptable if
you're processing credit card transactions, or if you're running a
mailserver and you're unwilling to silently drop any messages, or ... stuff
like that.

Long story short, if you're willing to allow your server and all of the
dependent clients to go back in time as much as 30 seconds, and you're
willing/able to reboot everything that depends on it, then you can accept
the sync=disabled

That's a lot of thinking.  And a lot of faith or uncertainty.  And in your
case, it's kind of inconvenient.  Needing to manually start your NFS share
every time you reboot your ZFS server.

The safer/easier thing to do is add dedicated log devices to the server
instead.  It's not as fast as running with ZIL disabled, but it's much
faster than running without a dedicated log.

When choosing a log device, focus on FAST.  You really don't care about
size.  Even 4G is usually all you need. 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Couple of questions about ZFS on laptops

2011-11-08 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 
 1) Use a ZFS mirror of two SSDs
 - seems too pricey
 2) Use a HDD with redundant data (copies=2 or mirroring
 over two partitions), and an SSD for L2ARC (+maybe ZIL)
 - possible unreliability if the only HDD breaks
 3) Use a ZFS mirror of two HDDs
 - lowest performance
 4) Use a ZFS mirror of two HDDs and an SD card for L2ARC.
 Perhaps add another built-in flash card with PCMCIA
 adapters for CF, etc.

The performance of a SSD or flash drive or SD card is almost entirely
dependent on the robustness/versatility of the built-in controller circuit.
You can rest assured that no SD card and no USB device is going to have
performance even remotely close to a decent SSD, except under the conditions
that are specifically optimized for that device.  The manufacturers, of
course, will publish their maximum specs, and the real world usage of the
device might be an order of magnitude lower.

A little while back, I performed an experiment - I went out and bought the
best rated, most expensive USB3 flash drives I could find, and I benchmarked
them against the cheapest USB2 hard drives I could find.  The hard drives
won by a clear margin, like 4x to 8x faster, except when running large
sequential dd to/from the raw flash device on the first boot - in which
case the flash won by a small margin (like 10%)

Given your hardware limitations, the only way to go fast is to use a SSD,
and the only way to go fast with redundancy is to use a mirror of two SSD's.

If you don't go for the SSD's, then your HDD's will be the second fastest
option.  Do not put any SD card into the mix.  It will only hurt you.


 Second question regards single-HDD reliability: I can
 do ZFS mirroring over two partitions/slices, or I can
 configure copies=2 for the datasets. Either way I
 think I can get protection from bad blocks of whatever
 nature, as long as the spindle spins. Can these two
 methods be considered equivalent, or is one preferred
 (and for what reason)?

I would opt for the copies=2 method, because it's reconfigurable if you
want, and it's designed to work within a single pool, so it more closely
resembles your actual usage.  If you mirror across two partitions on the
same disk, there may be unintended performance consequences because nobody
expected you to do that when they wrote the code.


 Also, how do other list readers place and solve their
 preferences with their OpenSolaris-based laptops? ;)

I'm sorry to say, there is no ZFS-based OS and no laptop hardware that I
consider to be a reliable combination.  Of course I haven't tested them all,
but I don't believe in any of them because it's unintended, uncharted,
untested, unsupported.  I think you'll find the best support for this
subject on the openindiana mailing lists.

After oracle acquired sun, most of the home users and laptop users left the
opensolaris mailing lists in favor of the openindiana lists.  The people
that remain here are primarily focused on enterprise and servers.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-11-08 Thread Matt Breitbach
This is accomplished with the Nexenta HA cluster plugin.  The plugin is
written by RSF, and you can read more about it here :
http://www.high-availability.com/

You can do either option 1 or two that you put forth.  There is some
failover time, but in the latest version of Nexenta (3.1.1) there are some
additional tweaks that bring the failover time down significantly.
Depending on pool configuration and load, failover can be done in under 10
seconds based on some of my internal testing.

-Matt Breitbach

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Jim Klimov
Sent: Tuesday, November 08, 2011 5:53 PM
To: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

Hello all,

   A couple of months ago I wrote up some ideas about clustered
ZFS with shared storage, but the idea was generally disregarded
as not something to be done in near-term due to technological
difficultes.

   Recently I stumbled upon a Nexenta+Supermicro report [1] about
cluster-in-a-box with shared storage boasting an active-active
cluster with transparent failover. Now, I am not certain how
these two phrases fit in the same sentence, and maybe it is some
marketing-people mixup, but I have a couple of options:

1) The shared storage (all 16 disks are accessible to both
motherboards) is split into two ZFS pools, each mounted
by one node normally. If a node fails, another imports
the pool and continues serving it.

2) All disks are aggregated into one pool, and one node
serves it while another is in hot standby.

Ideas (1) and (2) may possibly contradict the claim that
the failover is seamless and transparent to clients.
A pool import usually takes some time, maybe long if
fixups are needed; and TCP sessions are likely to get
broken. Still, maybe the clusterware solves this...


3) Nexenta did implement a shared ZFS pool with both nodes
accessing all of the data instantly and cleanly.
Can this be true? ;)


If this is not a deeply-kept trade secret, can the Nexenta
people elaborate in technical terms how this cluster works?

[1] http://www.nexenta.com/corp/sbb?gclid=CIzBg-aEqKwCFUK9zAodCSscsA

Thanks,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Data distribution not even between vdevs

2011-11-08 Thread Ding Honghui
Hi list,

My zfs write performance is poor and need your help.

I create zpool with 2 raidz1. When the space is to be used up, I add 2
another raidz1 to extend the zpool.
After some days, the zpool is almost full, I remove some old data.

But now, as show below, the first 2 raidz1 vdev usage is about 78% and the
last 2 raidz1 vdev usage is about 93%.

I have line in /etc/system

set zfs:metaslab_df_free_pct=4

So the performance degrade will happen when the vdev usage is above 90%.

All my file is small files which size is about 150KB.

Now the questions is:
1. Should I balance the data between the vdevs by copy the data and remove
the data which locate in last 2 vdevs?
2. Is there any method to automatically re-balance the data?
or
Any better solution to resolve this problem?

root@nas-01:~# zpool iostat -v
   capacity operations
bandwidth
pool used  avail   read  write   read
write
--  -  -  -  -  -
-
datapool21.3T  3.93T 26 96  81.4K
2.81M
  raidz14.93T  1.39T  8 28  25.7K
708K
c3t600221900085486703B2490FB009d0  -  -  3 10
216K   119K
c3t600221900085486703B4490FB063d0  -  -  3 10
214K   119K
c3t6002219000852889055F4CB79C10d0  -  -  3 10
214K   119K
c3t600221900085486703B8490FB0FFd0  -  -  3 10
215K   119K
c3t600221900085486703BA490FB14Fd0  -  -  3 10
215K   119K
c3t6002219000852889041C490FAFA0d0  -  -  3 10
215K   119K
c3t600221900085486703C0490FB27Dd0  -  -  3 10
214K   119K
  raidz14.64T  1.67T  8 32  24.6K
581K
c3t600221900085486703C2490FB2BFd0  -  -  3 10
224K  98.2K
c3t6002219000852889041F490FAFD0d0  -  -  3 10
222K  98.2K
c3t60022190008528890428490FB0D8d0  -  -  3 10
222K  98.2K
c3t60022190008528890422490FB02Cd0  -  -  3 10
223K  98.3K
c3t60022190008528890425490FB07Cd0  -  -  3 10
223K  98.3K
c3t60022190008528890434490FB24Ed0  -  -  3 10
223K  98.3K
c3t6002219000852889043949100968d0  -  -  3 10
224K  98.2K
  raidz15.88T   447G  5 17  16.0K
67.7K
c3t6002219000852889056B4CB79D66d0  -  -  3 12
215K  12.2K
c3t600221900085486704B94CB79F91d0  -  -  3 12
216K  12.2K
c3t600221900085486704BB4CB79FE1d0  -  -  3 12
214K  12.2K
c3t600221900085486704BD4CB7A035d0  -  -  3 12
215K  12.2K
c3t600221900085486704BF4CB7A0ABd0  -  -  3 12
216K  12.2K
c3t6002219000852889055C4CB79BB8d0  -  -  3 12
214K  12.2K
c3t600221900085486704C14CB7A0FDd0  -  -  3 12
215K  12.2K
  raidz15.88T   441G  4  1  14.9K
12.4K
c3t6002219000852889042B490FB124d0  -  -  1  1
131K  2.33K
c3t600221900085486704C54CB7A199d0  -  -  1  1
132K  2.33K
c3t600221900085486704C74CB7A1D5d0  -  -  1  1
130K  2.33K
c3t600221900085288905594CB79B64d0  -  -  1  1
133K  2.33K
c3t600221900085288905624CB79C86d0  -  -  1  1
132K  2.34K
c3t600221900085288905654CB79CCCd0  -  -  1  1
131K  2.34K
c3t600221900085288905684CB79D1Ed0  -  -  1  1
132K  2.33K
  c3t6B8AC6FF837605864DC9E9F1d0  0   928G  0 16289
1.47M
--  -  -  -  -  -
-

root@nas-01:~#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-11-08 Thread Daniel Carosone
On Wed, Nov 09, 2011 at 11:09:45AM +1100, Daniel Carosone wrote:
 On Wed, Nov 09, 2011 at 03:52:49AM +0400, Jim Klimov wrote:
Recently I stumbled upon a Nexenta+Supermicro report [1] about
  cluster-in-a-box with shared storage boasting an active-active
  cluster with transparent failover. Now, I am not certain how
  these two phrases fit in the same sentence, and maybe it is some
  marketing-people mixup,
 
 One way they can not be in conflict, is if each host normally owns 8
 disks and is active with it, and standby for the other 8 disks. 

Which, now that I reread it more carefully, is your case 1. 

Sorry for the noise.

--
Dan.

pgphTDpO9Oucq.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-11-08 Thread Nico Williams
To some people active-active means all cluster members serve the
same filesystems.

To others active-active means all cluster members serve some
filesystems and can serve all filesystems ultimately by taking over
failed cluster members.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss