Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

2012-01-09 Thread Darren J Moffat

On 01/08/12 18:21, Bob Friesenhahn wrote:

Something else to be aware of is that even if you don't have a dedicated
ZIL device, zfs will create a ZIL using devices in the main pool so


Terminology nit:  The log device is a SLOG.  Every ZFS dataset has a 
ZIL.  Where the ZIL writes (slog or main pool devices) go for a given 
dataset are determined by a combination of things including (but not 
limited to) the presence of a SLOG device, the logbias property and the 
size of the data.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

2012-01-09 Thread Jim Klimov

2012-01-08 5:45, Richard Elling wrote:

I think you will see a tradeoff on the read side of the mixed read/write 
workload.
Sync writes have higher priority than reads so the order of I/O sent to the disk
will appear to be very random and not significantly coalesced. This is the
pathological worst case workload for a HDD.


I guess this is what I'm trying to combat when thinking
about a dedicated ZIL (SLOG device) in ordedr to reduce
pool's fragmentation. It is my understanding (which may
be wrong and often is) that without a dedicated SLOG:

1) Sync writes will land on disk randomly into nearest
(to disk heads) available blocks, in order to have them
committed ASAP;

2) Coalesced writes (at TXG sync) may have intermixed
data and metadata blocks, of which metadata may soon
expire due to whatever updates, snapshots or deletions
involving the blocks this metadata references.
If this is true, then after a while there will be many
available cheese-holes from expired metadata among
larger data blocks.

3) Now, this might be further complicated (or relieved)
if the metadata blocks are stored in separate groupings
from the bulk user-data, which I don't know about yet.
In that case it would be easier for ZFS to prefetch
metadata from disk in one IO (as we discussed in another
thread), as well as to effectively reuse the small
cheese-holes from freed older metadata blocks.

---

If any of the above is true, then it is my blind
expectation that a dedicated ZIL/SLOG area would
decrease fragmentation at least due to sync writes
of metadata, and possibly of data, into nearest
HDD locations. Again, this is based on my possibly
wrong understanding that the blocks committed to a
SLOG would be neatly recommitted to the main pool
during a TXG close with coalesced writes.

I do understand the argument that if the SLOG is
dedicated from a certain area on the same HDD, then
in fact this would be slowing down the writes by
creating more random IO and extra seeks.
But as a trade-off I hope for more linear faster
reads, including pool import, scrubbing and ZDB
walks; and less fragmented free space.

Is there any truth to these words? ;)

Thanks,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs defragmentation via resilvering?

2012-01-09 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Bob Friesenhahn
 
 To put things in proper perspective, with 128K filesystem blocks, the
 worst case file fragmentation as a percentage is 0.39%
 (100*1/((128*1024)/512)).  On a Microsoft Windows system, the
 defragger might suggest that defragmentation is not warranted for this
 percentage level.

I don't think that's correct...
Suppose you write a 1G file to disk.  It is a database store.  Now you start
running your db server.  It starts performing transactions all over the
place.  It overwrites the middle 4k of the file, and it overwrites 512b
somewhere else, and so on.  Since this is COW, each one of these little
writes in the middle of the file will actually get mapped to unused sectors
of disk.  Depending on how quickly they're happening, they may be aggregated
as writes...  But that's not going to help the sequential read speed of the
file, later when you stop your db server and try to sequentially copy your
file for backup purposes.

In the pathological worst case, you would write a file that takes up half of
the disk.  Then you would snapshot it, and overwrite it in random order,
using the smallest possible block size.  Now your disk is 100% full, and if
you read that file, you will be performing worst case random IO spanning 50%
of the total disk space.  Granted, this is not a very realistic case, but it
is the worst case, and it's really really really bad for read performance.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

2012-01-09 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 
 1) Sync writes will land on disk randomly into nearest
 (to disk heads) available blocks, in order to have them
 committed ASAP;

This is true - but you need to make the distinction - if you don't have a
dedicated slog, and you haven't disabled zil, then the sync writes you're
talking about land into dedicated zil sectors of the disk.  This is
write-only space, consider it temporary.  The only time it will ever be read
is after an ungraceful system reboot, the system will scan these sectors to
see if anything is there.

As soon as the sync writes are written to the zil, they become async writes,
which are buffered in memory with all the other async writes, and they will
be written *again* into permanent storage in the main pool.  At that point,
the previously written copy in zil becomes irrelevant.


 If any of the above is true, then it is my blind
 expectation that a dedicated ZIL/SLOG area would
 decrease fragmentation at least due to sync writes

sync writes to zil aren't causing fragmentation, because they're only
temporary writes as long as they're sync mode.  Then they become async mode,
and they will be aggregated with all the other async writes.

This isn't saying fragmentation doesn't happen.  It's just saying there's no
special relationship between sync mode and fragmentation.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs defragmentation via resilvering?

2012-01-09 Thread Richard Elling
On Jan 9, 2012, at 5:44 AM, Edward Ned Harvey wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Bob Friesenhahn
 
 To put things in proper perspective, with 128K filesystem blocks, the
 worst case file fragmentation as a percentage is 0.39%
 (100*1/((128*1024)/512)).  On a Microsoft Windows system, the
 defragger might suggest that defragmentation is not warranted for this
 percentage level.
 
 I don't think that's correct...
 Suppose you write a 1G file to disk.  It is a database store.  Now you start
 running your db server.  It starts performing transactions all over the
 place.  It overwrites the middle 4k of the file, and it overwrites 512b
 somewhere else, and so on.  

It depends on the database, but many (eg Oracle database) are COW and
write fixed block sizes so your example does not apply.

 Since this is COW, each one of these little
 writes in the middle of the file will actually get mapped to unused sectors
 of disk.  Depending on how quickly they're happening, they may be aggregated
 as writes...  But that's not going to help the sequential read speed of the
 file, later when you stop your db server and try to sequentially copy your
 file for backup purposes.

Those who expect sequential to get performance out of HDDs usually end up
being sad :-( Interestingly, if you run Oracle database on top of ZFS on top of
SSDs, then you have COW over COW over COW. Now all we need is a bull! :-)
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
illumos meetup, Jan 10, 2012, Menlo Park, CA
http://www.meetup.com/illumos-User-Group/events/41665962/ 














___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs read-ahead and L2ARC

2012-01-09 Thread John Martin

On 01/08/12 20:10, Jim Klimov wrote:


Is it true or false that: ZFS might skip the cache and
go to disks for streaming reads?


I don't believe this was ever suggested.  Instead, if
data is not already in the file system cache and a
large read is made from disk should the file system
put this data into the cache?

BTW, I chose the term streaming to be a subset
of sequential where the access pattern is sequential but
at what appears to be artificial time intervals.
The suggested pre-read of the entire file would
be a simple sequential read done as quickly
as the hardware allows.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs read-ahead and L2ARC

2012-01-09 Thread Jim Klimov

Thanks for the replies, some more questions follow.

Your answers below seem to contradict each other somewhat.
Is it true that:
1) VDEV cache before b70 used to contain a full copy
   of prefetched disk contents,

2) VDEV cache since b70 analyzes the prefetched sectors
   and only keeps metadata blocks,

3) VDEV cache since b148 is disabled by default?

So in fact currently we only have file-level intelligent
prefetching?

On my older systems I fired kstat -p zfs:0:vdev_cache_stats
and saw hit/miss ratios ranging from 30% to 70%. On the oi_148a
box I do indeed see all-zeros.

While I do understand the implications of VDEV-caching lots
of disks on systems with inadequate RAM, I tend to find this
feature useful on smaller systems - like home-NASes. It is
essentially free in terms of mechanical seeks, as well as
in RAM (what is 60-100Mb for a small box at home?) and any
nonzero hit ratio that speeds up the system seems justifiable ;)

I've tried playing with the options on my oi_148a LiveUSB
repair boot, and got varying results:

VDEV is indeed disabled by default, but can be enabled.
My system is scrubbing now, so it's got a few cache hits
(about 10%) right away.

root@openindiana:~# echo zfs_vdev_cache_size/W0t1000 | mdb -kw
zfs_vdev_cache_size:0   =   0x989680

root@openindiana:~# kstat -p zfs:0:vdev_cache_stats
zfs:0:vdev_cache_stats:classmisc
zfs:0:vdev_cache_stats:crtime   65.042318652
zfs:0:vdev_cache_stats:delegations  72
zfs:0:vdev_cache_stats:hits 11
zfs:0:vdev_cache_stats:misses   158
zfs:0:vdev_cache_stats:snaptime 114232.782154249

However, trying to increase the prefetch size hung my system
almost immediately (in a couple of seconds). I'm away from
it now, so I'll ask for a photo of the console screen :)

root@openindiana:~# echo zfs_vdev_cache_max/W0t16384 | mdb -kw
zfs_vdev_cache_max: 0x4000  =   0x4000
root@openindiana:~# echo zfs_vdev_cache_bshift/W0t20 | mdb -kw
zfs_vdev_cache_bshift:  0x10=   0x14


So there are deeper questions:
1) As of Illumos bug #175 (as well as OpenSolaris b148 and
   if known - Solaris 11), is the vdev prefetch feature
   *removed* from codebase (no as of oi_148a, what about
   others?), or disabled by default (i.e. limit is preset
   to 0, tune it yourself)?

2) If it is only disabled, are there solid plans to remove
   it, or can we vote to keep it for those interested? :)

3) If the feature is present and gets enabled, how would
   VDEV prefetch play along with file prefetch, again? ;)

4) Is there some tuneable (after b70) to enable prefetching
   and keeping of user-data as well (not only metadata)?
   Perhaps only so that I could test it with my use-patterns
   to make sure that caching generic sectors is useless for
   me, and I really should revert to caching only metadata?

5) Would it make sense to increase zfs_vdev_cache_bshift?
   For example, when I tried to set it to 20 and prefetch
   a whole 1MB of data, why would that cause the system
   to die? Can it increase cache hit ratios (if works)?

6) Does the VDEV cache keep ZFS blocks or disk sectors?
   For example, on my 4k disks the blocks are 4k, even
   though there are a few hundred bytes worth of data in
   metadata blocks and 3+KB of slack space.

7) Modern HDDs often have 32-64Mb DRAM cache onboard.
   Is there any reason to match VDEV cache size with that
   in any way (1:1, 2:1, etc)?

Thanks again,
//Jim Klimov


2012-01-09 6:06, Richard Elling wrote:

On Jan 8, 2012, at 5:10 PM, Jim Klimov wrote:

2012-01-09 4:14, Richard Elling пишет:

On Jan 7, 2012, at 8:59 AM, Jim Klimov wrote:


I wonder if it is possible (currently or in the future as an RFE)
to tell ZFS to automatically read-ahead some files and cache them
in RAM and/or L2ARC?


See discussions on the ZFS intelligent prefetch algorithm. I think Ben 
Rockwood's
description is the best general description:
http://www.cuddletech.com/blog/pivot/entry.php?id=1040

And a more engineer-focused description is at:
http://www.solarisinternals.com/wiki/index.php/ZFS_Performance#Intelligent_prefetch
  -- richard


Thanks for the pointers. While I've seen those articles
(in fact, one of the two non-spam comments in Ben's
blog was mine), rehashing the basics is always useful ;)

Still, how does VDEV prefetch play along with File-level
Prefetch?


Trick question… it doesn't. vdev prefetching is disabled in opensolaris b148, 
illumos,
and Solaris 11 releases. The benefits of having the vdev cache for large 
numbers of
disks does not appear to justify the cost. See
http://wesunsolve.net/bugid/id/6684116
https://www.illumos.org/issues/175


For example, if ZFS prefetched 64K from disk
at the SPA level, and those sectors luckily happen to
contain next blocks of a streaming-read file, would
the file-level prefetch take the data from RAM cache
or still request them from the disk?


As of b70, vdev_cache only contains metadata. See

Re: [zfs-discuss] zfs read-ahead and L2ARC

2012-01-09 Thread John Martin

On 01/08/12 10:15, John Martin wrote:


I believe Joerg Moellenkamp published a discussion
several years ago on how L1ARC attempt to deal with the pollution
of the cache by large streaming reads, but I don't have
a bookmark handy (nor the knowledge of whether the
behavior is still accurate).


http://www.c0t0d0s0.org/archives/5329-Some-insight-into-the-read-cache-of-ZFS-or-The-ARC.html
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs read-ahead and L2ARC

2012-01-09 Thread Jim Klimov

2012-01-09 18:15, John Martin пишет:

On 01/08/12 20:10, Jim Klimov wrote:


Is it true or false that: ZFS might skip the cache and
go to disks for streaming reads?

  (The more I think
 about it, the more senseless this sentence seems, and
 I might have just mistaken it with ZIL writes of bulk
 data).


I don't believe this was ever suggested. Instead, if
data is not already in the file system cache and a
large read is made from disk should the file system
put this data into the cache?


Hmmm... perhaps THIS is what I could mistake it for...

Thus the correct version of the question goes like this:
is it true or false that some large reads from disk can
be deemed by ZFS as too big and rare to cache in ARC?
If yes, what conditions are checked to mark a read as
such? Can this behavior be disabled in order to try and
cache every read (further subject to normal eviction
due to MRU/MFU/memory pressure and other considerations)?

Thanks again,
//Jim Klimov


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs defragmentation via resilvering?

2012-01-09 Thread Bob Friesenhahn

On Mon, 9 Jan 2012, Edward Ned Harvey wrote:


I don't think that's correct...


But it is! :-)


Suppose you write a 1G file to disk.  It is a database store.  Now you start
running your db server.  It starts performing transactions all over the
place.  It overwrites the middle 4k of the file, and it overwrites 512b
somewhere else, and so on.  Since this is COW, each one of these little
writes in the middle of the file will actually get mapped to unused sectors
of disk.  Depending on how quickly they're happening, they may be aggregated


Oops.  I see an error in the above.  Other than tail blocks, or due to 
compression, zfs will not write a COW data block smaller than the zfs 
filesystem blocksize.  If the blocksize was 128K, then updating just 
one byte in that 128K block results in writing a whole new 128K block. 
This is pretty significant write-amplification but the resulting 
fragmentation is still limited by the 128K block size. Remember that 
any fragmentation calculation needs to be based on the disk's minimum 
read (i.e. sector) size.


However, it is worth remembering that it is common to set the block 
size to a much smaller value than default (e.g. 8K) if the filesystem 
is going to support a database.  In that case it is possible for there 
to be fragmentation for every 8K of data.  The worst case 
fragmentation pecentage for 8K blocks (and 512-byte sectors) is 6.25% 
((100*1/((8*1024)/512))).  That would be a high enough percentage that 
Microsoft Windows defrag would recommend defragging the disk.


Metadata chunks can not be any smaller than the disk's sector size 
(e.g. 512 bytes or 4K bytes).  Metadata can be seen as contributing to 
fragmentation, which is why it is so valuable to cache it.  If the 
metadata is not conveniently close to the data, then it may result in 
a big ugly disk seek (same impact as data fragmentation) to read it.


In summary, with zfs's default 128K block size, data fragmentation is 
not a significant issue, If the zfs filesystem block size is reduced 
to a much smaller value (e.g. 8K) then it can become a significant 
issue.  As Richard Elling points out, a database layered on top of zfs 
may already be fragmented by design.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs defragmentation via resilvering?

2012-01-09 Thread Jim Klimov

2012-01-09 19:14, Bob Friesenhahn wrote:


In summary, with zfs's default 128K block size, data fragmentation is
not a significant issue, If the zfs filesystem block size is reduced to
a much smaller value (e.g. 8K) then it can become a significant issue.
As Richard Elling points out, a database layered on top of zfs may
already be fragmented by design.


I THINK there is some fallacy in your discussion: I've seen 128K
referred to as the maximum filesystem block size, i.e. for large
streaming writes. For smaller writes ZFS adapts with smaller
blocks. I am not sure how it would rewrite a few bytes inside
a larger block - split it into many smaller ones or COW all 128K.

Intermixing variable-sized indivisible blocks can in turn lead
to more fragmentation than would otherwise be expected/possible ;)

Fixed block sizes are used (only?) for volume datasets.

 If the metadata is not conveniently close to the data, then it may
 result in a big ugly disk seek (same impact as data fragmentation)
 to read it.

Also I'm not sure about ths argument. If VDEV prefetch does not
slurp in data blocks, then by the time metadata is discovered in
read-from-disk blocks and data block locations are determined,
the disk may have rotated away from the head, so at least one
rotational delay is incurred even if metadata is immediately
followed by its referred data... no?

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thinking about spliting a zpool in system and data

2012-01-09 Thread Jesus Cea
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/01/12 13:39, Jim Klimov wrote:
 I have transitioned a number of systems roughly by the same
 procedure as you've outlined. Sadly, my notes are not in English so
 they wouldn't be of much help directly;

Yes, my russian is rusty :-).

I have bitten the bullet and spend 3-4 days doing the migration. I
wrote the details here:

http://www.jcea.es/artic/solaris_zfs_split.htm

The page is written in Spanish, but the terminal transcriptions should
be useful for everybody.

In the process, maybe somebody finds this interesting too:

http://www.jcea.es/artic/zfs_flash01.htm

Sorry, Spanish only too.

 Overall, your plan seems okay and has more failsafes than we've had
 - because longer downtimes were affordable ;) However, when doing
 such low-level stuff, you should make sure that you have remote
 access to your systems (ILOM, KVM, etc.; remotely-controlled PDUs
 for externally enforced

Yes, the migration I did had plenty of safety points (you can go back if
something doesn't work) and, most of the time, the system was in a
state able to survive accidental reboot. Downtime was minimal, less than
an hour in total (several reboots to validate configurations before
proceeding). I am quite pleased of the eventless migration, but I
planned it quite carefully. Worried about hitting bugs in Solaris/ZFS,
though. But it was very smooth.

The machine is hosted remotely but yes, I have remote-KVM. I can't
boot from remote media, but I have an OpenIndiana release in the SSD,
with VirtualBox installed and the Solaris 10 Update 10 release ISO,
just in case :-).

The only suspicious thing is that I keep swap (32GB) and dump
(4GB) in the data zpool, instead in system. Seems to work OK.
Crossing my fingers for the next Live Upgrade :-).

I have read your message after I migrated, but it was very
interesting. Thanks for taking the time to write it!.

Have a nice 2012.

- -- 
Jesus Cea Avion _/_/  _/_/_/_/_/_/
j...@jcea.es - http://www.jcea.es/ _/_/_/_/  _/_/_/_/  _/_/
jabber / xmpp:j...@jabber.org _/_/_/_/  _/_/_/_/_/
.  _/_/  _/_/_/_/  _/_/  _/_/
Things are not so easy  _/_/  _/_/_/_/  _/_/_/_/  _/_/
My name is Dump, Core Dump   _/_/_/_/_/_/  _/_/  _/_/
El amor es poner tu felicidad en la felicidad de otro - Leibniz
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQCVAwUBTwuvNJlgi5GaxT1NAQLJ0wP9EgpQnUdYCiLOnlGK8UC2QodT9s8KuqMK
5F9YwlPLdZ3S1DfWGKgC3k9MLbCfYLihM+KqysblsHs5Jf9/HGYSGK5Ky5HlYB5c
4vO+KrDU2eT/BYIVrDmFCucj8Fh8CN0Ule+Z5JtvhdlN/5rQ+osRmLQXr3SqQm6F
w/ilYwB09+0=
=fGc3
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss