Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

2012-01-11 Thread Daniel Carosone
On Thu, Jan 12, 2012 at 03:05:32PM +1100, Daniel Carosone wrote:
> On Sun, Jan 08, 2012 at 06:25:05PM -0800, Richard Elling wrote:
> > ZIL makes zero impact on resilver.  I'll have to check to see if L2ARC is 
> > still used, but
> > due to the nature of the ARC design, read-once workloads like backup or 
> > resilver do 
> > not tend to negatively impact frequently used data.
> 
> This is true, in a strict sense (they don't help resilver itself) but
> it misses the point. They (can) help the system, when resilver is
> underway. 
> 
> ZIL helps reduce the impact busy resilvering disks have on other system

Well, since I'm being strict and picky, I should of course say ZIL-on-slog.

> operation (sync write syscalls and vfs ops by apps).  L2ARC, likewise
> for reads.  Both can hide the latency increases that resilvering iops
> cause for the disks (and which the throttle you mentioned also
> attempts to minimise). 

--
Dan.


pgpJr64AafDRB.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

2012-01-11 Thread Daniel Carosone
On Sun, Jan 08, 2012 at 06:25:05PM -0800, Richard Elling wrote:
> ZIL makes zero impact on resilver.  I'll have to check to see if L2ARC is 
> still used, but
> due to the nature of the ARC design, read-once workloads like backup or 
> resilver do 
> not tend to negatively impact frequently used data.

This is true, in a strict sense (they don't help resilver itself) but
it misses the point. They (can) help the system, when resilver is
underway. 

ZIL helps reduce the impact busy resilvering disks have on other system
operation (sync write syscalls and vfs ops by apps).  L2ARC, likewise
for reads.  Both can hide the latency increases that resilvering iops
cause for the disks (and which the throttle you mentioned also
attempts to minimise). 

--
Dan.


pgpHuumFi1QZ5.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How many "rollback" TXGs in a ring for 4k drives?

2012-01-11 Thread Richard Elling
On Jan 11, 2012, at 5:01 AM, Jim Klimov wrote:

> Hello all, I found this dialog on the zfs-de...@zfsonlinux.org list,
> and I'd like someone to confirm-or-reject the discussed statement.
> Paraphrasing in my words and understanding:
>  "Labels, including Uberblock rings, are fixed 256KB in size each,
>   of which 128KB is the UB ring. Normally there is 1KB of data in
>   one UB, which gives 128 TXGs to rollback to. When ashift=12 is
>   used for 4k-sector disks, each UB is allocated a 4KB block, of
>   which 3KB is padding. And now we only have 32 TXGs of rollback."
> 
> Is this understanding correct?

Yes.
 -- richard

> That's something I did not think of
> previously, indeed...
> 
> Thanks,
> //Jim
> 
> 
> http://groups.google.com/a/zfsonlinux.org/group/zfs-devel/browse_thread/thread/182c92911950ccd6/aa2ad1fdaf7d7a07?pli=1
> 
> On Aug 1, 2011, at 2:17 PM, Brian Behlendorf wrote:
> > On Mon, 2011-08-01 at 10:21 -0700, Zachary Bedell wrote:
> >> Given that uberblocks are 1k in size normally
> >> and that they're supposed to be written out in single atomic
> >> operations, does setting ashift=12 cause ZFS to pad out the uberblock
> >> to 4k so that only one block is in each atomic write?  Assuming it
> >> does so, the label size must still be the normal 256k which would
> >> leave fewer uberblock slots in the ring (32 instead of 128)?
> 
> > Exactly right.  When ashift=12 then the uberblock size is padded out
> > to 4k.  That means only 32 uberblocks fit in the on-disk space
> > reserved for the ring.  It's one of the lesser known side effects
> > of increasing the ashift.
> 
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 

ZFS and performance consulting
http://www.RichardElling.com
illumos meetup, Jan 10, 2012, Menlo Park, CA
http://www.meetup.com/illumos-User-Group/events/41665962/ 














___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Clarifications wanted for ZFS spec

2012-01-11 Thread Jim Klimov

I'm reading the "ZFS On-disk Format" PDF (dated 2006 -
are there newer releases?), and have some questions
regarding whether it is outdated:

1) On page 16 it has the following phrase (which I think
is in general invalid):
  The value stored in offset is the offset in terms of
  sectors (512 byte blocks). To find the physical block
  byte offset from the beginning of a slice, the value
  inside offset must be shifted over (<<) by 9 (2^9=512)
  and this value must be added to 0x40 (size of two
  vdev_labels and boot block).

Does this calculation really go on in hard-coded 2^9
values, or in VDEV-dependant ashift values (i.e. 2^12
for 4k disks, 2^10 for default raidz, etc.)?

2) Likewise, in Section 2.6 (block size entries) the
values of lsize/psize/asize are said to be represented
by the number of 512-byte sectors. Does this statement
hold true for ashift!=9 VDEVs/pools as well?

3) In Section 1.3 they discuss the format of VDEV labels.
As I'm researching this with the intent of repairing my
pool's label (core problem posted yesterday in thread
"Doublefree/doubledelete"), I wondered if the labels
are protected by any checksums. The document does not
state anything about it, so I guess the labels are only
protected by 4-way redundancy - that's it?..

4) As I asked today in thread "How many rollback TXGs
in a ring for 4k drives?", there was an understanding
by our Linux-ZFS comrades that each uberblock takes up
some amount of disk blocks, with minimal allocation
based on ashift value; thus on ashift=12 pools there
are only 32 rollback TXGs.

The PDF spec (section 1.3 overview) states that each
UB entry size is 1KB as part of the label structure;
does this mean that for ashift=12 pools there are 128
entries as well? If this is the case, I think the
Linux guys should be informed, to avoid incompatible
implementations ;)

5) The label contains an NVList of "related" VDEVs...
does this factually limit the amount of devices which
can comprise a (top-level) VDEV?

I have seen some blog entry (Eric Schrock's, I think)
where the author discussed the initial graph-based
VDEV indexation, with each VDEV referring to about
3 neighbors; during import/scan it was possible to
either find all devices or deduce that some (and which)
are missing. But due to some drawbacks of that ASCII
based implementation they moved to NVLists.

I wonder if such required-device interpolation is
done now, or there can be as many VDEVs as can fit
into the 112KB of NVList size?

That's about all I can ask for teh first 10 pages
of spec text ;)

Thanks,
//Jim Klimov


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-11 Thread Jim Klimov

2012-01-11 20:40, Nico Williams пишет:

On Wed, Jan 11, 2012 at 9:16 AM, Jim Klimov  wrote:

I've recently had a sort of an opposite thought: yes,
ZFS redundancy is good - but also expensive in terms
of raw disk space. This is especially bad for hardware
space-constrained systems like laptops and home-NASes,
where doubling the number of HDDs (for mirrors) or
adding tens of percent of storage for raidZ is often
not practical for whatever reason.


Redundancy through RAID-Z and mirroring is expensive for home systems
and laptops, but mostly due to the cost of SATA/SAS ports, not the
cost of the drives.  The drives are cheap, but getting an extra disk
in a laptop is either impossible or expensive.  But that doesn't mean
you can't mirror slices or use ditto blocks.  For laptops just use
ditto blocks and either zfs send or external mirror that you
attach/detach.


Yes, basically that's what we do now, and it halves the
available disk space and increases latency (extra seeks) ;)

I get (and share) your concern about ECC entry size for
larger blocks. NOTE: I don't know the ECC algorithms
deeply enough to speculate about space requirements,
except that as they are used in networking/RAM, an ECC
correction code for 4-8 bits of userdata is 1-2 bits long.

I'm reading the "ZFS On-disk Format" PDF (dated 2006 -
are there newer releases?), and on page 15 the blkptr_t
structure has 192 bits of padding before TXG. Can't that
be used for a reasonably large ECC code?

Besides, I see that blkptr_t is 128 bytes in size.
This leaves us with some slack space in a physical
sector, which can be "abused" without extra costs -
(512-128) or (4096-128) bytes worth of {ECC} data.
Perhaps the padding space (near TXG entry) could
be used to specify that the blkptr_t bytes are
immediately followed by ECC bytes (and their size,
probably dependent on data block length), so that
larger on-disk block pointer blocks could be used
on legacy systems as well (using several contiguous
512 byte sectors). After successful reads from disk,
this ECC data can be discarded to save space in
ARC/L2ARC allocation (especially if every byte of
memory is ECC protected anyway).

Even if the ideas/storage above is not practical,
perhaps ECC codes can be used for smaller blocks (i.e.
{indirect} block pointer contents and metadata might
be "guaranteed" to be small enough). If nothing else,
this could save mechanical seek times if a CKSUM
error is detected as is normal for ZFS reads, but a
built-in/referring block's ECC code infromation is
enough to repair this block. In this case we don't
need to re-request data from another disk... and we
have some more error-resiliency beside ditto blocks
(already enforced for metadata) or raidz/mirrors.
While it is (barely) possible that all ditto replicas
are broken, there's a non-zero chance that at least
one is recoverable :)





Current ZFS checksums allow us to detect errors, but
in order for recovery to actually work, there should be
a redundant copy and/or parity block available and valid.

Hence the question: why not put ECC info into ZFS blocks?


RAID-Zn *is* an error correction system.  But what you are asking for
is a same-device error correction method that costs less than ditto
blocks, with error correction data baked into the blkptr_t.  Are there
enough free bits left in the block pointer for error correction codes
for large blocks?  (128KB blocks, but eventually ZFS needs to support
even larger blocks, so keep that in mind.)  My guess is: no.  Error
correction data might have to get stored elsewhere.

I don't find this terribly attractive, but maybe I'm just not looking
at it the right way.  Perhaps there is a killer enterprise feature for
ECC here: stretching MTTDL in the face of a device failure in a mirror
or raid-z configuration (but if failures are typically of whole drives
rather than individual blocks, then this wouldn't help).  But without
a good answer for where to store the ECC for the largest blocks, I
don't see this happening.


Well, it is often mentioned that (by Murphy's Law if nothing
else) device failures in RAID often are not single-device
failures. So traditional RAID5s tended to die while replacing
a dead disk onto a spare and detecting an error on an existing
unreplicated disk.

Per-block ECC could be used in this case to recover from
bit-rot errors on remaining alive disks when RAID-Zn or
mirror don't help, decreasing the chance that tape backup
is the only recovery option remaining...

//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-11 Thread Nico Williams
On Wed, Jan 11, 2012 at 9:16 AM, Jim Klimov  wrote:
> I've recently had a sort of an opposite thought: yes,
> ZFS redundancy is good - but also expensive in terms
> of raw disk space. This is especially bad for hardware
> space-constrained systems like laptops and home-NASes,
> where doubling the number of HDDs (for mirrors) or
> adding tens of percent of storage for raidZ is often
> not practical for whatever reason.

Redundancy through RAID-Z and mirroring is expensive for home systems
and laptops, but mostly due to the cost of SATA/SAS ports, not the
cost of the drives.  The drives are cheap, but getting an extra disk
in a laptop is either impossible or expensive.  But that doesn't mean
you can't mirror slices or use ditto blocks.  For laptops just use
ditto blocks and either zfs send or external mirror that you
attach/detach.

> Current ZFS checksums allow us to detect errors, but
> in order for recovery to actually work, there should be
> a redundant copy and/or parity block available and valid.
>
> Hence the question: why not put ECC info into ZFS blocks?

RAID-Zn *is* an error correction system.  But what you are asking for
is a same-device error correction method that costs less than ditto
blocks, with error correction data baked into the blkptr_t.  Are there
enough free bits left in the block pointer for error correction codes
for large blocks?  (128KB blocks, but eventually ZFS needs to support
even larger blocks, so keep that in mind.)  My guess is: no.  Error
correction data might have to get stored elsewhere.

I don't find this terribly attractive, but maybe I'm just not looking
at it the right way.  Perhaps there is a killer enterprise feature for
ECC here: stretching MTTDL in the face of a device failure in a mirror
or raid-z configuration (but if failures are typically of whole drives
rather than individual blocks, then this wouldn't help).  But without
a good answer for where to store the ECC for the largest blocks, I
don't see this happening.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-11 Thread Jim Klimov

Hello all, I have a new "crazy idea" of the day ;)

  Some years ago there was an idea proposed in one of ZFS
developers' blogs (maybe Jeff's? sorry, can't find and
link it now) that went somewhat along these lines:

   Modern disks have some ECC/CRC codes for each sector,
   and uses them to test read-in data. If the disk fails
   to produce a sector correctly, it tries harder to read
   it and reallocates the LBA from a spare-sector region,
   if possible. This leads to some more random IO for
   linearly-numbered LBA sectors, as well as waste of
   disk space for spare sectors and checksums - at least
   in comparison to better error-detection and redundancy
   of ZFS checksums. Besides, attempts to re-read a faulty
   sector may succeed or they may produce undeteced garbage,
   and take some time (maybe seconds) if the retries fail
   consistently. Then the block is marked bad and data is
   lost.

   The article went on to suggest "let's get an OEM vendor
   to give us same disks without their kludges, and we'll
   get (20%?) more platter-speed and volume, better used
   by ZFS error-detection and repair mechanisms".

I've recently had a sort of an opposite thought: yes,
ZFS redundancy is good - but also expensive in terms
of raw disk space. This is especially bad for hardware
space-constrained systems like laptops and home-NASes,
where doubling the number of HDDs (for mirrors) or
adding tens of percent of storage for raidZ is often
not practical for whatever reason.

Current ZFS checksums allow us to detect errors, but
in order for recovery to actually work, there should be
a redundant copy and/or parity block available and valid.

Hence the question: why not put ECC info into ZFS blocks?
IMHO, pluggable ECC (like pluggable compression or
varied checksums - in this case ECC algorithms allowing
for recovery of 1 or 2 bits, for example) would be cheaper
on disk space than redundancy (few % instead of 25-50% of
disk space), and still allow for recovery of certain errors,
such as on-disk or on-wire bit rot, even in single-disk
ZFS pools.

This could be an inheritable per-dataset attribute
like compression, encryption, dedup or checksum
algorithms.

Replacement of recovered "faulted" blocks into currently
free space is already part of ZFS, except that now it
might have to track the notion of "permanently-bad block
lists" and decreasing space available for addressing on
each leaf VDEV. There should also be a mechanism to
retest and clear such blocks, i.e. when a faulty drive
or LUN is replaced by a new one (perhaps with DD'ing
of an old hardware drive to a new one, and replacement,
while the pool is offline) - probably as a special
scrub-like command to zpool, also invoked during scrub.

This may be combined with the wish for OEM disks that
lack hardware ECC/spare sectors in return for more
performance; although I'm not sure how good that would
be in practice - the hardware creator's in-depth
knowledge of how to retry reading initially "faulty"
blocks, i.e. by changing voltage or platter speeds
or whatever, may be invaluable and not replaceable
by software.

What do you think? Doable? Useful? Why not, if not? ;)

Thanks,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] How many "rollback" TXGs in a ring for 4k drives?

2012-01-11 Thread Jim Klimov

Hello all, I found this dialog on the zfs-de...@zfsonlinux.org list,
and I'd like someone to confirm-or-reject the discussed statement.
Paraphrasing in my words and understanding:
  "Labels, including Uberblock rings, are fixed 256KB in size each,
   of which 128KB is the UB ring. Normally there is 1KB of data in
   one UB, which gives 128 TXGs to rollback to. When ashift=12 is
   used for 4k-sector disks, each UB is allocated a 4KB block, of
   which 3KB is padding. And now we only have 32 TXGs of rollback."

Is this understanding correct? That's something I did not think of
previously, indeed...

Thanks,
//Jim


http://groups.google.com/a/zfsonlinux.org/group/zfs-devel/browse_thread/thread/182c92911950ccd6/aa2ad1fdaf7d7a07?pli=1

On Aug 1, 2011, at 2:17 PM, Brian Behlendorf wrote:
> On Mon, 2011-08-01 at 10:21 -0700, Zachary Bedell wrote:
>> Given that uberblocks are 1k in size normally
>> and that they're supposed to be written out in single atomic
>> operations, does setting ashift=12 cause ZFS to pad out the uberblock
>> to 4k so that only one block is in each atomic write?  Assuming it
>> does so, the label size must still be the normal 256k which would
>> leave fewer uberblock slots in the ring (32 instead of 128)?

> Exactly right.  When ashift=12 then the uberblock size is padded out
> to 4k.  That means only 32 uberblocks fit in the on-disk space
> reserved for the ring.  It's one of the lesser known side effects
> of increasing the ashift.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: add an option/attribute to import ZFS pool without automounting/sharing ZFS datasets

2012-01-11 Thread Jim Klimov

2012-01-11 16:00, Darren J Moffat пишет:



On 01/11/12 11:48, Jim Klimov wrote:

I think about adding the following RFE to illumos bugtracker:
add an option/attribute to import ZFS pool without
automounting/sharing ZFS datasets

I wonder if something like this (like a tricky workaround)
is already in place?



-N

Import the pool without mounting any file systems.


If it isn't mounted it can't be shared.



Thanks!
Sounds good, except that I don't see this description in the
manpages (oi_148a LiveUSB I'm currently repair-booting from).
The flag is listed in the command-line help though (zpool -h).

Thanks, I'll try that the next boot ;)
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: add an option/attribute to import ZFS pool without automounting/sharing ZFS datasets

2012-01-11 Thread Darren J Moffat



On 01/11/12 11:48, Jim Klimov wrote:

I think about adding the following RFE to illumos bugtracker:
add an option/attribute to import ZFS pool without
automounting/sharing ZFS datasets

I wonder if something like this (like a tricky workaround)
is already in place?



 -N

 Import the pool without mounting any file systems.


If it isn't mounted it can't be shared.

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] RFE: add an option/attribute to import ZFS pool without automounting/sharing ZFS datasets

2012-01-11 Thread Jim Klimov

I think about adding the following RFE to illumos bugtracker:
  add an option/attribute to import ZFS pool without
  automounting/sharing ZFS datasets

I wonder if something like this (like a tricky workaround)
is already in place?

--
My rationale is the currently ongoing repairs and inspections
of my pool, which often require reboots and lengthy imports
of the pool in order to start/continue scrubbing. During this
time I don't benefit from automounting nor sharing the datasets,
it only adds some 3+minute delays for the import.

Actually, since the pool is in an uncertain state, inadvertent
writes into its datasets *might* potentially be harmful and
would best be avoided.

I currently do this by "zpool import pool; zfs umount -a"
However, skipping "zpool mount" step in the first place would
suit me better and faster ;)
--

Thanks,
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Unable to allocate dma memory for extra SGL

2012-01-11 Thread Hung-Sheng Tsao (Lao Tsao 老曹) Ph. D.



On 1/10/2012 9:44 PM, Ray Van Dolson wrote:

On Tue, Jan 10, 2012 at 06:23:50PM -0800, Hung-Sheng Tsao (laoTsao) wrote:

how is the ram size what is the zpool setup and what is your hba and
hdd size and type

Hmm, actually this system has only 6GB of memory.  For some reason I
though it had more.

IMHO,  you will need more RAM
did you cap the ARC in /etc/system?



The controller is an LSISAS2008 (which oddly enough dose not seem to be
recognized by lsiutil).

There are 23x1TB disks (SATA interface, not SAS unfortunately) in the
system.  Three RAIDZ2 vdevs of seven disks each and one spare comprises
a single zpool with two zfs file systems mounted (no deduplication or
compression in use).

There are two internally mounted Intel X-25E's -- these double as the
rootpool and ZIL devices.

There is an 80GB X-25M mounted to the expander along with the 1TB
drives operating as L2ARC.


On Jan 10, 2012, at 21:07, Ray Van Dolson  wrote:


Hi all;

We have a Solaris 10 U9 x86 instance running on Silicon Mechanics /
SuperMicro hardware.

Occasionally under high load (ZFS scrub for example), the box becomes
non-responsive (it continues to respond to ping but nothing else works
-- not even the local console).  Our only solution is to hard reset
after which everything comes up normally.

Logs are showing the following:

  Jan  8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: 
/pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
  Jan  8 09:44:08 prodsys-dmz-zfs2MPT SGL mem alloc failed
  Jan  8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: 
/pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
  Jan  8 09:44:08 prodsys-dmz-zfs2Unable to allocate dma memory for 
extra SGL.
  Jan  8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: 
/pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
  Jan  8 09:44:08 prodsys-dmz-zfs2Unable to allocate dma memory for 
extra SGL.
  Jan  8 09:44:10 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: 
/pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
  Jan  8 09:44:10 prodsys-dmz-zfs2Unable to allocate dma memory for 
extra SGL.
  Jan  8 09:44:10 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: 
/pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
  Jan  8 09:44:10 prodsys-dmz-zfs2MPT SGL mem alloc failed
  Jan  8 09:44:11 prodsys-dmz-zfs2 rpcmod: [ID 851375 kern.warning] WARNING: 
svc_cots_kdup no slots free

I am able to resolve the last error by adjusting upwards the duplicate
request cache sizes, but have been unable to find anything on the MPT
SGL errors.

Anyone have any thoughts on what this error might be?

At this point, we are simply going to apply patches to this box (we do
see an outstanding mpt patch):

147150 --<  01 R-- 124 SunOS 5.10_x86: mpt_sas patch
147702 --<  03 R--  21 SunOS 5.10_x86: mpt patch

But we have another identically configured box at the same patch level
(admittedly with slightly less workload, though it also undergoes
monthly zfs scrubs) which does not experience this issue.

Ray

Thanks,
Ray
<>___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs read-ahead and L2ARC

2012-01-11 Thread Jim Klimov

2012-01-11 1:26, Jim Klimov пишет:

To follow on the subject of VDEV caching, even if
only of metadata, in oi_148a, I have found the
disabling entry in /etc/system of the LiveUSB:

set zfs:zfs_vdev_cache_size=0


Now that I have the cache turned on and my scrub
continues, cache efficiency so far happens to be
75%. Not bad for a feature turned off by default:

# kstat -p zfs:0:vdev_cache_stats
zfs:0:vdev_cache_stats:class misc
zfs:0:vdev_cache_stats:crtime 60.67302806
zfs:0:vdev_cache_stats:delegations 22619
zfs:0:vdev_cache_stats:hits 32989
zfs:0:vdev_cache_stats:misses 10676
zfs:0:vdev_cache_stats:snaptime 39898.161717983

//Jim


And at this moment I can guess the caching effect
becomes incredible (at least for a feature disabled
and dismissed at useless/harmful) - if I read the
numbers correctly, a 99+% cache hit ratio with
just VDEV prereads:

# kstat -p zfs:0:vdev_cache_stats
zfs:0:vdev_cache_stats:classmisc
zfs:0:vdev_cache_stats:crtime   60.67302806
zfs:0:vdev_cache_stats:delegations  23398
zfs:0:vdev_cache_stats:hits 1309308
zfs:0:vdev_cache_stats:misses   11592
zfs:0:vdev_cache_stats:snaptime 89207.679698161

True, the task (scrubbing) is metadata-intensive :)
Still, for the future, when beginning a scrub the
system might auto-tune or at least suggest to enable
the VDEV prefetch, perhaps with larger strokes)...

BTW, what does the "delegations" field mean? ;)


--


++
||
| Климов Евгений, Jim Klimov |
| технический директор   CTO |
| ЗАО "ЦОС и ВТ"  JSC COS&HT |
||
| +7-903-7705859 (cellular)  mailto:jimkli...@cos.ru |
|CC:ad...@cos.ru,jimkli...@gmail.com |
++
| ()  ascii ribbon campaign - against html mail  |
| /\- against microsoft attachments  |
++


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss