Re: [zfs-discuss] ZFS best practice for FreeBSD?

2012-10-12 Thread andy thomas

On Thu, 11 Oct 2012, Freddie Cash wrote:


On Thu, Oct 11, 2012 at 2:47 PM, andy thomas a...@time-domain.co.uk wrote:

According to a Sun document called something like 'ZFS best practice' I read
some time ago, best practice was to use the entire disk for ZFS and not to
partition or slice it in any way. Does this advice hold good for FreeBSD as
well?


Solaris disabled the disk cache if the disk was partitioned, thus the
recommendation to always use the entire disk with ZFS.

FreeBSD's GEOM architecture allows the disk cache to be enabled
whether you use the full disk or partition it.

Personally, I find it nicer to use GPT partitions on the disk.  That
way, you can start the partition at 1 MB (gpart add -b 2048 on 512B
disks, or gpart add -b 512 on 4K disks), leave a little wiggle-room
at the end of the disk, and use GPT labels to identify the disk (using
gpt/label-name for the device when adding to the pool).


This is apparently what had been done in this case:

gpart add -b 34 -s 600 -t freebsd-swap da0
gpart add -b 634 -s 1947525101 -t freebsd-zfs da1
gpart show

(stuff relating to a compact flash/SATA boot disk deleted)

=34  1953525101  da0  GPT  (932G)
  34 6001  freebsd-swap  (2.9G)
 634  19475251012  freebsd-zfs  (929G)

=34  1953525101  da2  GPT  (932G)
  34 6001  freebsd-swap  (2.9G)
 634  19475251012  freebsd-zfs  (929G)

=34  1953525101  da1  GPT  (932G)
  34 6001  freebsd-swap  (2.9G)
 634  19475251012  freebsd-zfs  (929G)


Is this a good scheme? The server has 12 G of memory (upped from 4 GB last 
year after it kept crashing with out of memory reports on the console 
screen) so I doubt the swap would actually be used very often. Running 
Bonnie++ on this pool comes up with some very good results for sequential 
disk writes but the latency of over 43 seconds for block reads is 
terrible and is obviously impacting performance as a mail server, as shown 
here:


Version  1.96   --Sequential Output-- --Sequential Input- --Random-
Concurrency   1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
hsl-main.hsl.of 24G63  67 80584  20 70568  17   314  98 554226  60 410.1  13
Latency 77140us   43145ms   28872ms 171ms 212ms 232ms
Version  1.96   --Sequential Create-- Random Create
hsl-main.hsl.office -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 16 19261  93 + +++ 18491  97 21542  92 + +++ 20691  94
Latency 15399us 488us 226us   27733us 103us 138us


The other issue with this server is it needs to be rebooted every 8-10 
weeks as disk I/O slows to a crawl over time and the server becomes 
unusable. After a reboot, it's fine again. I'm told ZFS 13 on FreeBSD 8.0 
has a lot of problems so I was planning to rebuild the server with FreeBSD 
9.0 and ZFS 28 but I didn't want to make any basic design mistakes in 
doing this.



Another point about the Sun ZFS paper - it mentioned optimum performance
would be obtained with RAIDz pools if the number of disks was between 3 and
9. So I've always limited my pools to a maximum of 9 active disks plus
spares but the other day someone here was talking of seeing hundreds of
disks in a single pool! So what is the current advice for ZFS in Solaris and
FreeBSD?


You can have multiple disks in a vdev.  And you can multiple vdevs in
a pool.  Thus, you can have hundred of disks in a pool.  :)  Just
split the disks up into multiple vdevs, where each vdev is under 9
disks each.  :)  For example, we have 25 disks in the following pool,
but only 6 disks in each vdev (plus log/cache):


[root@alphadrive ~]# zpool list -v
NAMESIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
storage24.5T  20.7T  3.76T84%  3.88x  DEGRADED  -
 raidz2   8.12T  6.78T  1.34T -
   gpt/disk-a1-  -  - -
   gpt/disk-a2-  -  - -
   gpt/disk-a3-  -  - -
   gpt/disk-a4-  -  - -
   gpt/disk-a5-  -  - -
   gpt/disk-a6-  -  - -
 raidz2   5.44T  4.57T   888G -
   gpt/disk-b1-  -  - -
   gpt/disk-b2-  -  - -
   gpt/disk-b3-  -  - -
   gpt/disk-b4-  -  - -
   gpt/disk-b5-  -  - -
   gpt/disk-b6-  -  - -
 raidz2   5.44T  4.60T   863G -
   gpt/disk-c1

Re: [zfs-discuss] ZFS best practice for FreeBSD?

2012-10-12 Thread andy thomas

On Thu, 11 Oct 2012, Richard Elling wrote:


On Oct 11, 2012, at 2:58 PM, Phillip Wagstrom phillip.wagst...@gmail.com 
wrote:



On Oct 11, 2012, at 4:47 PM, andy thomas wrote:


According to a Sun document called something like 'ZFS best practice' I read 
some time ago, best practice was to use the entire disk for ZFS and not to 
partition or slice it in any way. Does this advice hold good for FreeBSD as 
well?


My understanding of the best practice was that with Solaris prior to 
ZFS, it disabled the volatile disk cache.


This is not quite correct. If you use the whole disk ZFS will attempt to enable 
the
write cache. To understand why, remember that UFS (and ext, by default) can die 
a
horrible death (+fsck) if there is a power outage and cached data is not 
flushed to disk.
So by default, Sun shipped some disks with write cache disabled by default. For 
non-Sun
disks, they are most often shipped with write cache enabled and the most 
popular file
systems (NTFS) properly issue cache flush requests as needed (for the same 
reason ZFS
issues cache flush requests).


Out of interest, how do you enable the write cache on a disk? I recently 
replaced a failing Dell-branded disk on a Dell server with an HP-branded 
disk (both disks were the identical Seagate model) and on running the EFI 
diagnostics just to check all was well, it reported the write cache was 
disabled on the new HP disk but enabled on the remaining Dell disks in the 
server. I couldn't see any way of enabling the cache from the EFI diags so 
I left it as it was - probably not ideal.



With ZFS, the disk cache is used, but after every transaction a cache-flush 
command is issued to ensure that the data made it the platters.


Write cache is flushed after uberblock updates and for ZIL writes. This is 
important for
uberblock updates, so the uberblock doesn't point to a garbaged MOS. It is 
important
for ZIL writes, because they must be guaranteed written to media before ack.


Thanks for the explanation, that all makes sense now.

Andy


 If you slice the disk, enabling the disk cache for the whole disk is dangerous 
because other file systems (meaning UFS) wouldn't do the cache-flush and there 
was a risk for data-loss should the cache fail due to, say a power outage.
Can't speak to how BSD deals with the disk cache.


I looked at a server earlier this week that was running FreeBSD 8.0 and had 2 x 
1 Tb SAS disks in a ZFS 13 mirror with a third identical disk as a spare. Large 
file I/O throughput was OK but the mail jail it hosted had periods when it was 
very slow with accessing lots of small files. All three disks (the two in the 
ZFS mirror plus the spare) had been partitioned with gpart so that partition 1 
was a 6 GB swap and partition 2 filled the rest of the disk and had a 
'freebsd-zfs' partition on it. It was these second partitions that were part of 
the mirror.

This doesn't sound like a very good idea to me as surelt disk seeks for swap 
and for ZFS file I/O are bound to clash. aren't they?


It surely would make a slow, memory starved swapping system even 
slower.  :)


Another point about the Sun ZFS paper - it mentioned optimum performance would 
be obtained with RAIDz pools if the number of disks was between 3 and 9. So 
I've always limited my pools to a maximum of 9 active disks plus spares but the 
other day someone here was talking of seeing hundreds of disks in a single 
pool! So what is the current advice for ZFS in Solaris and FreeBSD?


That number was drives per vdev, not per pool.

-Phil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--

richard.ell...@richardelling.com
+1-760-896-4422













-
Andy Thomas,
Time Domain Systems

Tel: +44 (0)7866 556626
Fax: +44 (0)20 8372 2582
http://www.time-domain.co.uk
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS best practice for FreeBSD?

2012-10-12 Thread Jim Klimov

2012-10-12 11:11, andy thomas wrote:

Great, thanks for the explanation! I didn't realise you could have a
sort of 'stacked pyramid' vdev/pool structure.


Well, you can - the layers are pool - top-level VDEVs - leaf
VDEVs, though on trivial pools like single-disk ones, the layers
kinda merge into one or two :) This should be described in the
manpage in greater detail.

So the pool stripes over Top-Level VDEVs (TLVDEVs), roughly by
round-robining whole logical blocks upon write, and then each
tlvdev depending on its redundancy configuration forms the sectors
to be written onto its component leaf vdevs (low-level disks,
partitions or slices, luns, files, etc.) Since full-stripe writes
are not required by ZFS, smaller blocks can consume less sectors
than there are leafs (disks) in a tlvdev, but this does not result
in lost space holes nor in RMW cycles like on full-stripe RAID
systems. If there's a free hole of contiguous logical addressing
(roughly, striped across leaf vdevs within the tlvdev), where the
userdata sectors (after optional compression) plus the redundancy
sectors fit - it will be used.

I guess it is because of this contiguous addressing that a tlvdev
with raidzN can not (currently) change the number of component disks,
and a pool can not decrease the number of tlvdevs. If you add new
tlvdevs to an existing pool, the ZFS algorithms will try to put
some more load on emptier tlvdevs and balance the writes, although
according to discussions, this can still lead to disbalance and
performance problems on particular installations.

In fact, you can (although not recommended due to balancing reasons)
have tlvdevs of mixed size (like in Freddie's example) and even of
different structure (i.e. mixing raidz and mirrors or even single
LUNs) by forcing the disk attachment.

Note however that a loss of a tlvdev kills your whole pool, so
don't stripe important data over single disks/luns ;)

And you don't have control of what gets written where, so you'd
also get an averaged performance mix of raidz and mirrors with
unpredictable performance for particular userdata block's storage.

HTH,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS best practice for FreeBSD?

2012-10-12 Thread Peter Jeremy
On 2012-Oct-12 08:11:13 +0100, andy thomas a...@time-domain.co.uk wrote:
This is apparently what had been done in this case:

   gpart add -b 34 -s 600 -t freebsd-swap da0
   gpart add -b 634 -s 1947525101 -t freebsd-zfs da1
   gpart show

Assuming that you can be sure that you'll keep 512B sector disks,
that's OK but I'd recommend that you align both the swap and ZFS
partitions on at least 4KiB boundaries for future-proofing (ie
you can safely stick the same partition table onto a 4KiB disk
in future).

Is this a good scheme? The server has 12 G of memory (upped from 4 GB last 
year after it kept crashing with out of memory reports on the console 
screen) so I doubt the swap would actually be used very often.

Having enough swap to hold a crashdump is useful.  You might consider
using gmirror for swap redundancy (though 3-way is overkill).  (And
I'd strongly recommend against swapping to a zvol or ZFS - FreeBSD has
issues with that combination).

The other issue with this server is it needs to be rebooted every 8-10 
weeks as disk I/O slows to a crawl over time and the server becomes 
unusable. After a reboot, it's fine again. I'm told ZFS 13 on FreeBSD 8.0 
has a lot of problems

Yes, it does - and your symptoms match one of the problems.  Does
top(1) report lots of inactive and cache memory and very little free
memory and a high kstat.zfs.misc.arcstats.memory_throttle_count once
I/O starts slowing down?

 so I was planning to rebuild the server with FreeBSD 
9.0 and ZFS 28 but I didn't want to make any basic design mistakes in 
doing this.

I'd suggest you test 9.1-RC2 (just released) with a view to using 9.1,
rather than installing 9.0.

Since your questions are FreeBSD specific, you might prefer to ask on
the freebsd-fs list.

-- 
Peter Jeremy


pgpoDwzmWvFUU.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Building an On-Site and Off-Size ZFS server, replication question

2012-10-12 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: Richard Elling [mailto:richard.ell...@gmail.com]
 
 Pedantically, a pool can be made in a file, so it works the same...

Pool can only be made in a file, by a system that is able to create a pool.  
Point is, his receiving system runs linux and doesn't have any zfs; his 
receiving system is remote from his sending system, and it has been suggested 
that he might consider making an iscsi target available, so the sending system 
could zpool create and zfs receive directly into a file or device on the 
receiving system, but it doesn't seem as if that's going to be possible for him 
- he's expecting to transport the data over ssh.  So he's looking for a way to 
do a zfs receive on a linux system, transported over ssh.  Suggested answers 
so far include building a VM on the receiving side, to run openindiana (or 
whatever) or using zfs-fuse-linux. 

He is currently writing his zfs send datastream into a series of files on the 
receiving system, but this has a few disadvantages as compared to doing zfs 
receive on the receiving side.  Namely, increased risk of data loss and less 
granularity for restores.  For these reasons, it's been suggested to find a way 
of receiving via zfs receive and he's exploring the possibilities of how to 
improve upon this situation.  Namely, how to zfs receive on a remote linux 
system via ssh, instead of cat'ing or redirecting into a series of files.

There, I think I've recapped the whole thread now.   ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS best practice for FreeBSD?

2012-10-12 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of andy thomas
 
 According to a Sun document called something like 'ZFS best practice' I
 read some time ago, best practice was to use the entire disk for ZFS and
 not to partition or slice it in any way. Does this advice hold good for
 FreeBSD as well?

I'm not going to address the FreeBSD question.  I know others have made some 
comments on the best practice on solaris, but here goes:

There are two reasons for the best practice of not partitioning.  And I 
disagree with them both.

First, by default, the on-disk write cache is disabled.  But if you use the 
whole disk in a zpool, then zfs enables the cache.  If you partition a disk and 
use it for only zpool's, then you might want to manually enable the cache 
yourself.  This is a fairly straightforward scripting exercise.  You may use 
this if you want:  (No warranty, etc, it will probably destroy your system if 
you don't read and understand and rewrite it yourself before attempting to use 
it.)
https://dl.dropbox.com/u/543241/dedup%20tests/cachecontrol/cachecontrol.zip

If you do that, you'll need to re-enable the cache once on each boot (or zfs 
mount).

The second reason is because when you zpool import it doesn't automatically 
check all the partitions of all the devices - it only scans devices.  So if you 
are forced to move your disks to a new system, you try to import, you get an 
error message, you panic and destroy your disks.  To overcome this problem, you 
just need to be good at remembering the disks were partitioned - Perhaps you 
should make a habit of partitioning *all* of your disks, so you'll *always* 
remember.  On zpool import, you need to specify the partitions to scan for 
zpools.  I believe this is the zpool import -d option.

And finally - 

There are at least a couple of solid reasons *in favor* of partitioning.

#1  It seems common, at least to me, that I'll build a server with let's say, 
12 disk slots, and we'll be using 2T disks or something like that.  The OS 
itself only takes like 30G which means if I don't partition, I'm wasting 1.99T 
on each of the first two disks.  As a result, when installing the OS, I always 
partition rpool down to ~80G or 100G, and I will always add the second 
partitions of the first disks to the main data pool.

#2  A long time ago, there was a bug, where you couldn't attach a mirror unless 
the two devices had precisely the same geometry.  That was addressed in a 
bugfix a couple of years ago.  (I had a failed SSD mirror, and Sun shipped me a 
new SSD with a different firmware rev, and the size of the replacement device 
was off by 1 block, so I couldn't replace the failed SSD).  After the bugfix, a 
mirror can be attached if there's a little bit of variation in the sizes of the 
two devices.  But it's not quite enough - As recently as 2 weeks ago, I tried 
to attach two devices that were precisely the same, but couldn't because of the 
different size.  One of them was a local device, and the other was an iscsi 
target.   So I guess iscsi must require a little bit of space, and that was 
enough to make the devices un-mirror-able without partitioning.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Building an On-Site and Off-Size ZFS server, replication question

2012-10-12 Thread Jim Klimov
2012-10-12 16:50, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) пишет:


 So he's looking for a way to do a zfs receive on a linux system, 
transported over ssh.  Suggested answers so far include building a VM on 
the receiving side, to run openindiana (or whatever) or using 
zfs-fuse-linux.



For completeness, if iSCSI target on the receiving host or another
similar solution is implemented, the secure networking part of
zfs send over ssh (local sending into a pool on an iSCSI target)
can be done by a VPN, i.e. OpenVPN which uses the same OpenSSL
encryption.

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Building an On-Site and Off-Size ZFS server, replication question

2012-10-12 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
Jim, I'm trying to contact you off-list, but it doesn't seem to be working.  
Can you please contact me off-list?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS best practice for FreeBSD?

2012-10-12 Thread Freddie Cash
On Fri, Oct 12, 2012 at 3:28 AM, Jim Klimov jimkli...@cos.ru wrote:
 In fact, you can (although not recommended due to balancing reasons)
 have tlvdevs of mixed size (like in Freddie's example) and even of
 different structure (i.e. mixing raidz and mirrors or even single
 LUNs) by forcing the disk attachment.

My example shows 4 raidz2 vdevs, with each vdev having 6 disks, along
with a log vdev, and a cache vdev.  Not sure where you're seeing an
imbalance.  Maybe it's because the pool is currently resilvering a
drive, thus making it look like one of the vdevs has 7 drives?

My home file server ran with mixed vdevs for awhile (a 2 IDE-disk
mirror vdev with a 3 SATA-disk raidz1 vdev) as it was built using
scrounged parts.

But all my work file servers have matched vdevs.

-- 
Freddie Cash
fjwc...@gmail.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Building an On-Site and Off-Size ZFS server, replication question

2012-10-12 Thread Richard Elling
On Oct 12, 2012, at 5:50 AM, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 From: Richard Elling [mailto:richard.ell...@gmail.com]
 
 Pedantically, a pool can be made in a file, so it works the same...
 
 Pool can only be made in a file, by a system that is able to create a pool.  

You can't send a pool, you can only send a dataset. Whether you receive the 
dataset
into a pool or file is a minor nit, the send stream itself is consistent.

 Point is, his receiving system runs linux and doesn't have any zfs; his 
 receiving system is remote from his sending system, and it has been suggested 
 that he might consider making an iscsi target available, so the sending 
 system could zpool create and zfs receive directly into a file or device 
 on the receiving system, but it doesn't seem as if that's going to be 
 possible for him - he's expecting to transport the data over ssh.  So he's 
 looking for a way to do a zfs receive on a linux system, transported over 
 ssh.  Suggested answers so far include building a VM on the receiving side, 
 to run openindiana (or whatever) or using zfs-fuse-linux. 
 
 He is currently writing his zfs send datastream into a series of files on 
 the receiving system, but this has a few disadvantages as compared to doing 
 zfs receive on the receiving side.  Namely, increased risk of data loss and 
 less granularity for restores.  For these reasons, it's been suggested to 
 find a way of receiving via zfs receive and he's exploring the 
 possibilities of how to improve upon this situation.  Namely, how to zfs 
 receive on a remote linux system via ssh, instead of cat'ing or redirecting 
 into a series of files.
 
 There, I think I've recapped the whole thread now.   ;-)


Yep, and cat works fine.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS best practice for FreeBSD?

2012-10-12 Thread Ian Collins
On 10/13/12 02:12, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:

There are at least a couple of solid reasons *in favor* of partitioning.

#1  It seems common, at least to me, that I'll build a server with let's say, 
12 disk slots, and we'll be using 2T disks or something like that.  The OS 
itself only takes like 30G which means if I don't partition, I'm wasting 1.99T 
on each of the first two disks.  As a result, when installing the OS, I always 
partition rpool down to ~80G or 100G, and I will always add the second 
partitions of the first disks to the main data pool.


How do you provision a spare in that situation?

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss