[zfs-discuss] ZFS Import failed: Request rejected: too large for CDB

2009-11-25 Thread Ortwin Herbst
Hello,

 

I am new to this list but i have a big Problem:

 

We have a Sun Fire V440 with an SCSI RAID system connected. I can see all the 
devices and Partitions.

After a failure in the UPS-System the Zpool is not accessible anymore. 

The Zpool is a normal stripe over 4 Partitions . 

 

First we made a zpool export Produktion  to keep the pool in order.

 

But now we can not import the pool anymore and we get the following error:

The Command was: zpool import -f Produktion

 

Nov 25 10:41:06 MDSP scsi: WARNING: /p...@1f,70/s...@2,1/s...@0,0 (sd25):

Nov 25 10:41:06 MDSPRequest rejected: too large for CDB: lba:0x30022cbd2  
len:0x0010

Nov 25 10:41:06 MDSP scsi: WARNING: /p...@1f,70/s...@2,1/s...@0,0 (sd25):

Nov 25 10:41:06 MDSPRequest rejected: too large for CDB: lba:0x30022cbd2  
len:0x0010

Nov 25 10:41:06 MDSP scsi: WARNING: /p...@1f,70/s...@2,1/s...@0,0 (sd25):

Nov 25 10:41:06 MDSPRequest rejected: too large for CDB: lba:0x30022cdd2  
len:0x0010

Nov 25 10:41:06 MDSP scsi: WARNING: /p...@1f,70/s...@2,1/s...@0,0 (sd25):

Nov 25 10:41:06 MDSPRequest rejected: too large for CDB: lba:0x30022cdd2  
len:0x0010

Nov 25 10:41:06 MDSP scsi: WARNING: /p...@1f,70/s...@2,1/s...@0,0 (sd25):

Nov 25 10:41:06 MDSPRequest rejected: too large for CDB: lba:0x30022cdd2  
len:0x0010

  Pool: Produktion

id: 64650935418607444

 state: FAULTED

Zustand: Die Pool-Metadaten sind beschädigt.

Aktion: Der Pool kann aufgrund von beschädigten Geräten oder Daten nicht 
importiert werden.

Der Pool ist eventuell auf einem anderen System aktiv, kann aber mit

dem Flag '-f' importiert werden.

   Siehe: http://www.sun.com/msg/ZFS-8000-72

config:

 

Produktion  FAULTED   Beschädigte Daten

  c7t0d1ONLINE

  c7t0d2ONLINE

  c7t0d3ONLINE

  c7t0d4ONLINE

 

is there any chance to import the pool again?

 

Thanks for Help in advanced

 

 

Mit freundlichen Grüßen

Ortwin Herbst
Gepr. Wirtschaftsinformatiker

Rädler GmbH
EDV-Systeme für das Grafische Gewerbe
Conradtystraße 43
90441 Nürnberg
Telefon: 09 11 / 9 56 61 00
Telefax: 09 11 / 9 56 61 80
Mobil: 01 72 / 8 64 68 33

Sitz der Gesellschaft: Landsberg/Lech
Registergericht: Augsburg 
Handelsregisternummer: HRB19775
Geschäftsführer: Josef Jordan, Wolfgang Rädler

= Für jedes Problem gibt es eine Lösung: die einfache, die schnelle und die 
falsche.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Import failed: Request rejected: too large for CDB

2009-11-25 Thread Chris Gerhard
Your pool is on a device that requires a 16 byte CDB to address the entire LUN. 
That is the LUN is more than 2Tb in size. However the host bus adapter driver 
that is being used does not support 16byte CDBs.

Quite how you got into this situation, ie how you could create the volume  I 
don't know, unless you have grown the LUN since the pool was created or somehow 
the host bus adapter driver has been downgraded since the pool was created.

--chris
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heads up: SUNWzfs-auto-snapshot obsoletion in snv 128

2009-11-25 Thread Kjetil Torgrim Homme
Daniel Carosone d...@geek.com.au writes:

 you can fetch the cr_txg (cr for creation) for a
 snapshot using zdb,

 yes, but this is hardly an appropriate interface.

agreed.

 zdb is also likely to cause disk activity because it looks at many
 things other than the specific item in question.

I'd expect meta-information like this to fit comfortably in RAM over
extended amounts of time.  haven't tried, though.

 but the very creation of a snapshot requires a new
 txg to note that fact in the pool.

 yes, which is exactly what we're trying to avoid, because it requires
 disk activity to write.

you missed my point: you can't compare the current txg to an old cr_txg
directly, since the current txg value will be at least 1 higher, even if
no changes have been made.

 if the snapshot is taken recursively, all snapshots will have the
 same cr_txg, but that requires the same configuration for all
 filesets.

 again, yes, but that's irrelevant - the important knowledge at this
 moment is that the txg has not changed since last time, and that thus
 there will be no benefit in taking further snapshots, regardless of
 configuration.

yes, that's what we're trying to establish, and it's easier when
all snapshots are commited in the same txg.
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Random Read Performance

2009-11-25 Thread Paul Kraus
Richard,
First, thank you for the detailed reply ... (comments in line below)

On Tue, Nov 24, 2009 at 6:31 PM, Richard Elling
richard.ell...@gmail.com wrote:
 more below...

 On Nov 24, 2009, at 9:29 AM, Paul Kraus wrote:

 On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling
 richard.ell...@gmail.com wrote:

 Try disabling prefetch.

 Just tried it... no change in random read (still 17-18 MB/sec for a
 single thread), but sequential read performance dropped from about 200
 MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file
 accessed in 256 KB records. ARC is set to a max of 1 GB for testing.
 arcstat.pl shows that the vast majority (95%) of reads are missing
 the cache.

 hmmm... more testing needed. The question is whether the low
 I/O rate is because of zfs itself, or the application? Disabling prefetch
 will expose the application, because zfs is not creating additional
 and perhaps unnecessary read I/O.

The values reported by iozone are in pretty close agreement with what
we are seeing with iostat during the test runs. Compression is off on
zfs (the iozone test data compresses very well and yields bogus
results). I am looking for a good alternative to iozone for random
testing, I did put together a crude script to spawn many dd processes
accessing the block device itself, each with a different seek over the
range of the disk and saw results much greater than the iozone single
threaded random performance.

 Your data which shows the sequential write, random write, and
 sequential read driving actv to 35 is because prefetching is enabled
 for the read.  We expect the writes to drive to 35 with a sustained
 write workload of any flavor.

Understood. I tried tuning the queue size to 50 and observed that the
actv went to 50 (with very little difference in performance), so
returned it to the default of 35.

 The random read (with cache misses)
 will stall the application, so it takes a lot of threads (16?) to keep
 35 concurrent I/Os in the pipeline without prefetching.  The ZFS
 prefetching algorithm is intelligent so it actually complicates the
 interpretation of the data.

What bothers me is that that iostat is showing the 'disk' device as
not being saturated during the random read test. I'll post iostat
output that I captured yesterday to http://www.ilk.org/~ppk/Geek/ You
can clearly see the various test phases (sequential write, rewrite,
sequential read, reread, random read, then random write).

 You're peaking at 658 256KB random IOPS for the 3511, or ~66
 IOPS per drive.  Since ZFS will max out at 128KB per I/O, the disks
 see something more than 66 IOPS each.  The IOPS data from
 iostat would be a better metric to observe than bandwidth.  These
 drives are good for about 80 random IOPS each, so you may be
 close to disk saturation.  The iostat data for IOPS and svc_t will
 confirm.

But ... if I am saturating the 3511 with one thread, then why do I get
many times that performance with multiple threads ?

 The T2000 data (sheet 3) shows pretty consistently around
 90 256KB IOPS per drive. Like the 3511 case, this is perhaps 20%
 less than I would expect, perhaps due to the measurement.

I ran the T2000 test to see if 10U8 behaved better and to make sure I
wasn't seeing an oddity of the 480 / 3511 case. I wanted to see if the
random read bahavior was similar, and it was (in relative terms).

 Also, the 3511 RAID-5 configuration will perform random reads at
 around 1/2 IOPS capacity if the partition offset is 34.  This was the
 default long ago.  The new default is 256.

Our 3511's have been running 421F (latest) for a long time :-) We are
religious about keeping all the 3511 FW current and matched.

 The reason is that with
 a 34 block offset, you are almost guaranteed that a larger I/O will
 stride 2 disks.  You won't notice this as easily with a single thread,
 but it will be measurable with more threads. Double check the
 offset with prtvtoc or format.

How do I check offset ... format - verify from one of the partitionsis below:

format ver

Volume name = 
ascii name  = SUN-StorEdge 3511-421F-517.23GB
bytes/sector=  512
sectors = 1084710911
accessible sectors = 1084710878
Part  TagFlag First Sector  Size  Last Sector
  0usrwm   256   517.22GB   1084694494
  1 unassignedwm 000
  2 unassignedwm 000
  3 unassignedwm 000
  4 unassignedwm 000
  5 unassignedwm 000
  6 unassignedwm 000
  8   reservedwm1084694495 8.00MB   1084710878

format

 Writes are a completely different matter.  ZFS has a tendency to
 turn random writes into sequential writes, so it is pretty much
 useless to look at random write 

Re: [zfs-discuss] sharemgr

2009-11-25 Thread rwalists
On Nov 24, 2009, at 3:41 PM, dick hoogendijk wrote:

 I have a solution with use zfs set sharenfs=rw,nosuid zpool but i prefer
 use the sharemgr command.
 
 Then you prefere wrong. ZFS filesystems are not shared this way.
 Read up on ZFS and NFS.

It can also be done with sharemgr.  Shaving via ZFS creates a sharemgr group 
called 'zfs', but you can also share things directly via the sharemgr commands. 
 It is fairly well spelled out in the manpage:

http://docs.sun.com/app/docs/doc/819-2240/sharemgr-1m?a=view

Basically you want to create a group, set the group's properties and add a 
share to the group.


--Ware
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] sharemgr

2009-11-25 Thread Kyle McDonald

dick hoogendijk wrote:

glidic anthony wrote:

  

I have a solution with use zfs set sharenfs=rw,nosuid zpool but i prefer
use the sharemgr command.



Then you prefere wrong.

To each their own.

 ZFS filesystems are not shared this way.
  
They can be. I do it all the time. There's nothing technical that 
dictates that sharemgr can't be used on ZFS filesystems.
Just because ZFS provides an alternate way, that doesn't make it the 
only way, or even the 'one true way.'


About the only advantage I can see of using zfs share, is inheritance. 
If you don't need that, then sharemgr is just as good, and there are 
cases where it may be simpler - For instance, I loopback mount many many 
ISO's, and need to use sharemgr to share those anyway, I find it much 
more convienent to manage all my shares in one place with one tool.


If sharemgr could (optionally) manage inherited  sharing on ZFS 
filesystems, then I think it'd be cleaner to suggest to users to use the 
one system-wide sharing tool, rather that one that only works for one 
filesystem. I can't remember them right now, but I think there are other 
commands where ZFS seems to have done the same thing and I can't figure 
out why that's the trend? As great as ZFS is, it won't ever be the only 
filesystem around, ISOs (at least) will be around for a long time 
still.  Why start forcing users to learn new tools for each filesystem 
type?

Read up on ZFS and NFS.

  

What make you think he didn't?

While the docs do describe how you can optionally use zfs share (which 
he clearly read about since he mentioned it) they don't prohibit using 
sharemgr. I read his question as How can I get sharemgr to setup 
sharing so that it get inherited on child filesystems?


Apparently the answer to that question is You can't. If you want to 
set it up only once you need zfs share, and if you really want to use 
sharemgr you need to share each filesystem separately. Maybe someday 
that will change.


   -Kyle

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Random Read Performance

2009-11-25 Thread Paul Kraus
I posted baseline stats at http://www.ilk.org/~ppk/Geek/

baseline test was 1 thread, 3 GiB file, 64KiB to 512 KiB record size

480-3511-baseline.xls is an iozone output file

iostat-baseline.txt is the iostat output for the device in use (annotated)

I also noted an odd behavior yesterady and have not had a chance to
better qualify it. I was testing various combinations of vdev
quantities and mirror quantities.

As I changed the number of vdevs (stripes) from 1 through 8 (all
backed buy paritions on the same logical disk on the 3511) there was
no real change in sequential write, random write, or random read
performance. Sequential read performance did show a drop from 216
MiB/sec at 1 vdev to 180 MiB/sec. at 8 vdevs. This was about as
expected.

As I changed the number of mirro components things got interesting.
Keep in mind that I only have one 3511 for testing right now, I had to
use partitions from two other production 3511's to get three mirror
components on different arrays. As expected, as I went from 1 to 2 to
3 mirror components the write performance did not change, but the read
performance was interesting... see below:

read performance
mirrors  sequential  random
1  174 MiB/sec.  23 MiB/sec.
2  229 MiB/sec.  30 MiB/sec.
3  223 MiB/sec.  125 MiB/sec.

What they heck happened here ? 1 to 2 mirrors saw a large increase in
sequential read perfromance and from 2 to 3 mirrors show a HUGE
increase in random read performance. It feels like the behavior of
the zfs code changed between 2 and 3 mirrors for the random read data.

Now to investigate further, I tried multiple mirrors components on the
same array (my test 3511), not that you would do this in production,
but I was curious what would happen. In this case the throughput
degraded across the board as I added mirror components, as one would
expect. In the random read case the array was delivering less overall
performance than it was when it was one part of the earlier test (16
MiB/sec. combined vs. 1/3 of 125 MiB/sec.) See sheet 7 of
http://www.ilk.org/~ppk/Geek/throughput-summary.ods for these test
results. Sheet 8 is the last test I did last night, using the NRAID
logical disk type to try to get the 3511 to pass a disk through to
zfs, but get the advantage of the cache on the 3511. I'm not sure what
to read into those numbers.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
- Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
- Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
- Technical Advisor, Lunacon 2010 (http://www.lunacon.org/)
- Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] heads-up: dedup=fletcher4,verify was broken

2009-11-25 Thread Karl Rossing
When will SXCE 129 be released since 128 was passed over? There used to 
be a release calendar on opensolaris.org but I can't find it anymore.



Jeff Bonwick wrote:

And, for the record, this is my fault.  There is an aspect of endianness
that I simply hadn't thought of.  When I have a little more time I will
blog about the whole thing, because there are many useful lessons here.

Thank you, Matt, for all your help with this.  And my apologies to
everyone else for the disruption.

Jeff

On Mon, Nov 23, 2009 at 09:15:48PM -0800, Matthew Ahrens wrote:
  
We discovered another, more fundamental problem with 
dedup=fletcher4,verify. I've just putback the fix for:


6904243 zpool scrub/resilver doesn't work with cross-endian 
dedup=fletcher4,verify blocks


The same instructions as below apply, but in addition, the 
dedup=fletcher4,verify functionality has been removed.  We will investigate 
whether it's possible to fix these isses and re-enable this functionality.


--matt


Matthew Ahrens wrote:

If you did not do zfs set dedup=fletcher4,verify fs (which is 
available in build 128 and nightly bits since then), you can ignore this 
message.


We have changed the on-disk format of the pool when using 
dedup=fletcher4,verify with the integration of:


   6903705 dedup=fletcher4,verify doesn't byteswap correctly, has lots 
of hash collisions


This is not the default dedup setting; pools that only used zfs set 
dedup=on (or =sha256, or =verify, or =sha256,verify) are unaffected.


Before installing bits with this fix, you will need to destroy any 
filesystems that have had dedup=fletcher4,verify set on them.  You can 
preserve your existing data by running:


   zfs set dedup=any other setting old fs
   zfs snapshot -r old fs@snap
   zfs create new fs
   zfs send -R old fs@snap | zfs recv -d new fs
   zfs destroy -r old fs

Simply changing the setting from dedup=fletcher4,verify to another 
setting is not sufficient, as this does not modify existing data.


You can verify that your pool isn't using dedup=fletcher4,verify by running
   zdb -D pool | grep DDT-fletcher4
If there are no matches, your pool is not using dedup=fletcher4,verify, 
and it is safe to install bits with this fix.


Build 128 will be respun to include this fix.

Sorry for the inconvenience,

-- team zfs

  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  




CONFIDENTIALITY NOTICE:  This communication (including all attachments) is
confidential and is intended for the use of the named addressee(s) only and
may contain information that is private, confidential, privileged, and
exempt from disclosure under law.  All rights to privilege are expressly
claimed and reserved and are not waived.  Any use, dissemination,
distribution, copying or disclosure of this message and any attachments, in
whole or in part, by anyone other than the intended recipient(s) is strictly
prohibited.  If you have received this communication in error, please notify
the sender immediately, delete this communication from all data storage
devices and destroy all hard copies.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs-raidz - simulate disk failure

2009-11-25 Thread Richard Elling

On Nov 24, 2009, at 2:51 PM, Daniel Carosone wrote:

Those are great, but they're about testing the zfs software.   
There's a small amount of overlap, in that these injections include  
trying to simulate the hoped-for system response (e.g, EIO) to  
various physical scenarios, so it's worth looking at for scenario  
suggestions.


However, for most of us, we generally rely on Sun's (generally  
acknowledged as excellent) testing of the software stack.


I suspect the OP is more interested in verifying on his own  
hardware, that physical events and problems will be connected to the  
software fault injection test scenarios. The rest of us running on  
random commodity hardware have largely the same interest, because  
Sun hasn't qualified the hardware parts of the stack as well. We've  
taken on that responsibility ourselves (both individually, and as a  
community by sharing findings).


Agree 110%.


For example, for the various kinds of failures that might happen:
* Does my particular drive/controller/chipset/bios/etc combination  
notice the problem and result in the appropriate error from the  
driver upwards?
* How quickly does it notice? Do I have to wait for some long  
timeout or other retry cycle, and is that a problem for my usage?
* Does the rest of the system keep working to allow zfs to recover/ 
react, or is there some kind of follow-on failure (bus hangs/resets,  
etc) that will have wider impact?


Yanking disk controller and/or power cables is an easy and obvious  
test.  Testing scenarios that involve things like disk firmware  
behaviour in response to bad reads is harder - though apparently  
yelling at them might be worthwhile :-)


The problem is that yanking a disk tests the failure mode of yanking a  
disk.
If this is the sort of failure you expect to see, then perhaps you  
should look
at a mechanical solution. If you wish to test the failure modes you  
are likely
to see, then you need a more sophisticated test rig that will emulate  
a device

and inject the sorts of faults you expect.

Finding ways to dial up the load up your psu (or drop voltage/limit  
current to a specific device with an inline filter) might be an  
idea, since overloaded power supplies seem to be implicated in  
various people's reports of trouble.  Finding ways to generate EMF  
or cosmic rays to induce other kinds of failure is left as an  
exercise.


Many parts of the stack have software fault injection capabilities.   
Whether
you do this with something like zinject or the wansimulator, the  
principle is

the same.  For example, you could easily add wansimulator to an iSCSI
rig to inject packet corruption in the network. You can also roll your  
own with

Dtrace, which allows you to change the return values of any function.

The COMSTAR project has a test suite that could be leveraged, but it  
does
not appear to be explicitly designed to perform system tests.  I'm  
reasonably
confident that the driver teams have test code, too, but I would also  
expect
them to be oriented towards unit testing.  A quick search will turn up  
many

fault injection software programs geared towards unit testing.

Finally, there are companies that provide system-level test services.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] sharemgr

2009-11-25 Thread dick hoogendijk
On Wed, 2009-11-25 at 10:00 -0500, Kyle McDonald wrote:

 To each their own.
[cut the rest of your reply]

In general: I stand corrected. I was rude.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Random Read Performance

2009-11-25 Thread William D. Hathaway
If you are using (3) 3511's, then won't it be possibly that your 3GB workload 
will be largely or entirely served out of RAID controller cache?

Also, I had a question for your production backups (millions of small files), 
do you have atime=off set for the filesystems?  That might be helpful.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] heads-up: dedup=fletcher4,verify was broken

2009-11-25 Thread Bruno Sousa
Maybe 11/30/2009 ?

According to
http://hub.opensolaris.org/bin/view/Community+Group+on/schedule. we have
onnv_129 11/23/2009 11/30/2009

But..as far as i know those release dates are in a best effort basis.

Bruno

Karl Rossing wrote:
 When will SXCE 129 be released since 128 was passed over? There used
 to be a release calendar on opensolaris.org but I can't find it anymore.


 Jeff Bonwick wrote:
 And, for the record, this is my fault.  There is an aspect of endianness
 that I simply hadn't thought of.  When I have a little more time I will
 blog about the whole thing, because there are many useful lessons here.

 Thank you, Matt, for all your help with this.  And my apologies to
 everyone else for the disruption.

 Jeff

 On Mon, Nov 23, 2009 at 09:15:48PM -0800, Matthew Ahrens wrote:
  
 We discovered another, more fundamental problem with
 dedup=fletcher4,verify. I've just putback the fix for:

 6904243 zpool scrub/resilver doesn't work with cross-endian
 dedup=fletcher4,verify blocks

 The same instructions as below apply, but in addition, the
 dedup=fletcher4,verify functionality has been removed.  We will
 investigate whether it's possible to fix these isses and re-enable
 this functionality.

 --matt


 Matthew Ahrens wrote:

 If you did not do zfs set dedup=fletcher4,verify fs (which is
 available in build 128 and nightly bits since then), you can ignore
 this message.

 We have changed the on-disk format of the pool when using
 dedup=fletcher4,verify with the integration of:

6903705 dedup=fletcher4,verify doesn't byteswap correctly, has
 lots of hash collisions

 This is not the default dedup setting; pools that only used zfs
 set dedup=on (or =sha256, or =verify, or =sha256,verify) are
 unaffected.

 Before installing bits with this fix, you will need to destroy any
 filesystems that have had dedup=fletcher4,verify set on them.  You
 can preserve your existing data by running:

zfs set dedup=any other setting old fs
zfs snapshot -r old fs@snap
zfs create new fs
zfs send -R old fs@snap | zfs recv -d new fs
zfs destroy -r old fs

 Simply changing the setting from dedup=fletcher4,verify to another
 setting is not sufficient, as this does not modify existing data.

 You can verify that your pool isn't using dedup=fletcher4,verify by
 running
zdb -D pool | grep DDT-fletcher4
 If there are no matches, your pool is not using
 dedup=fletcher4,verify, and it is safe to install bits with this fix.

 Build 128 will be respun to include this fix.

 Sorry for the inconvenience,

 -- team zfs

   
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   



 CONFIDENTIALITY NOTICE:  This communication (including all
 attachments) is
 confidential and is intended for the use of the named addressee(s)
 only and
 may contain information that is private, confidential, privileged, and
 exempt from disclosure under law.  All rights to privilege are expressly
 claimed and reserved and are not waived.  Any use, dissemination,
 distribution, copying or disclosure of this message and any
 attachments, in
 whole or in part, by anyone other than the intended recipient(s) is
 strictly
 prohibited.  If you have received this communication in error, please
 notify
 the sender immediately, delete this communication from all data storage
 devices and destroy all hard copies.
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




smime.p7s
Description: S/MIME Cryptographic Signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (home NAS) zfs and spinning down of drives

2009-11-25 Thread R.G. Keen
Jim Sez:
 Like many others, I've come close to making a home
 NAS server based on 
 ZFS and OpenSolaris. While this is not an enterprise
 solution with high IOPS 
 expectation, but rather a low-power system for
 storing everything I have,
 I plan on cramming in some 6-10 5400RPM Green
 drives with low wattage 
 and high capacity, and possibly an SSD or two (or
 one-two spinning disks) 
 for Read/Write caching/logging.
Hey! Me  too! I'm up to buying hardware new to make it run.

Having read through the thread, I wonder is the best solution
might not be to make a minimal NAS-only box with a mirrored
pair(s) of drives for the daily updates, and spinning this off
at intervals via cron jobs or some such to long(er) term and 
safer storage in a second system that's the main raidz 
repository. 

Sure it's more elegant to have the momentary cache and safe
repository on the same set of hardware, but for another
$200 one can get a second whole system to work as the cache
and take all the on/off cycles, then power on the main backing
store system when something from deep freeze storage is 
needed, but keeping the recent working set in the cache 
system. 

This lets you schedule (for cheap electricity) the operations
of the deep freeze backing storage, while keeping its disks
mostly off, and minimizing power cycles on the disks down
to as little as 1/day.

Elegance is nice, but there are some places where more 
hardware can take it's place more quickly. 

Can you tell I'm at heart a hardware guy? 8-)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Opensolaris with J4400 - Experiences

2009-11-25 Thread Bruno Sousa
Hello !

I'm currently using a X2200 with a LSI HBA connected to a Supermicro
JBOD chassis, however i want to have more redundancy in the JBOD.
So i have looked into to market, and into to the wallet, and i think
that the Sun J4400 suits nicely to my goals. However i have some
concerns and if anyone can give some suggestions i would trully appreciate.
And now for my questions :

* Will i be able to achieve multipath support, if i connect the
  J4400 to 2 LSI HBA in one server, with SATA disks, or this is only
  possible with SAS disks? This server will have OpenSolaris (any
  release i think) .
* The CAM ( StorageTek Common Array Manager ), its only for hardware
  management of the JBOD, leaving
  disk/volumes/zpools/luns/whatever_name management up to the server
  operating system , correct ?
* Can i put some readzillas/writezillas in the j4400 along with sata
  disks, and if so will i have any benefit  , or should i place
  those *zillas directly into the servers disk tray?
* Does any one has experiences with those jbods? If so, are they in
  general solid/reliable ?
* The server will probably be a Sun x44xx series, with 32Gb ram, but
  for the best possible performance, should i invest in more and
  more spindles, or a couple less spindles and buy some readzillas?
  This system will be mainly used to export some volumes over ISCSI
  to a windows 2003 fileserver, and to hold some NFS shares.


Thank you for all your time,
Bruno


smime.p7s
Description: S/MIME Cryptographic Signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Random Read Performance

2009-11-25 Thread Mike Gerdts
On Wed, Nov 25, 2009 at 7:54 AM, Paul Kraus pk1...@gmail.com wrote:
 You're peaking at 658 256KB random IOPS for the 3511, or ~66
 IOPS per drive.  Since ZFS will max out at 128KB per I/O, the disks
 see something more than 66 IOPS each.  The IOPS data from
 iostat would be a better metric to observe than bandwidth.  These
 drives are good for about 80 random IOPS each, so you may be
 close to disk saturation.  The iostat data for IOPS and svc_t will
 confirm.

 But ... if I am saturating the 3511 with one thread, then why do I get
 many times that performance with multiple threads ?

I'm having troubles making sense of the iostat data (I can't tell how
many threads at any given point), but I do see lots of times where
asvc_t * reads is in the range 850 ms to 950 ms.  That is, this is as
fast as a single threaded app with a little bit of think time can
issue reads (100 reads * 9 ms svc_t + 100 reads * 1 ms think_time = 1
sec).  The %busy shows that 90+% of the time there is an I/O in flight
(100 reads * 9ms = 900/1000 = 90%).  However, %busy isn't aware of how
many I/O's could be in flight simultaneously.

When you fire up more threads, you are able to have more I/O's in
flight concurrently.  I don't believe that the I/O's per drive is
really a limiting factor at the single threaded case, as the spec
sheet for the 3511 says that it has 1 GB of cache per controller.
Your working set is small enough that it is somewhat likely that many
of those random reads will be served from cache.  A dtrace analysis of
just how random the reads are would be interesting.  I think that
hotspot.d from the DTrace Toolkit would be a good starting place.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Random Read Performance

2009-11-25 Thread Richard Elling

more below...

On Nov 25, 2009, at 5:54 AM, Paul Kraus wrote:


Richard,
   First, thank you for the detailed reply ... (comments in line  
below)


On Tue, Nov 24, 2009 at 6:31 PM, Richard Elling
richard.ell...@gmail.com wrote:

more below...

On Nov 24, 2009, at 9:29 AM, Paul Kraus wrote:


On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling
richard.ell...@gmail.com wrote:


Try disabling prefetch.


Just tried it... no change in random read (still 17-18 MB/sec for a
single thread), but sequential read performance dropped from about  
200

MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file
accessed in 256 KB records. ARC is set to a max of 1 GB for testing.
arcstat.pl shows that the vast majority (95%) of reads are missing
the cache.


hmmm... more testing needed. The question is whether the low
I/O rate is because of zfs itself, or the application? Disabling  
prefetch

will expose the application, because zfs is not creating additional
and perhaps unnecessary read I/O.


The values reported by iozone are in pretty close agreement with what
we are seeing with iostat during the test runs. Compression is off on
zfs (the iozone test data compresses very well and yields bogus
results). I am looking for a good alternative to iozone for random
testing, I did put together a crude script to spawn many dd processes
accessing the block device itself, each with a different seek over the
range of the disk and saw results much greater than the iozone single
threaded random performance.


filebench is usually bundled in /usr/benchmarks or as a pkg.
vdbench is easy to use and very portable, www.vdbench.org


Your data which shows the sequential write, random write, and
sequential read driving actv to 35 is because prefetching is enabled
for the read.  We expect the writes to drive to 35 with a sustained
write workload of any flavor.


Understood. I tried tuning the queue size to 50 and observed that the
actv went to 50 (with very little difference in performance), so
returned it to the default of 35.


Yep, bottleneck is on the back end (physical HDDs).  For arrays with  
lots

of HDDs, this queue can be deeper, but the 3500 series is way too
small to see this.  If SSDs are used on the back end, then you can
revisit this.

From the data, it does look like the random read tests are converging
on the media capabilities of the disks in the array.  For the array you
can see the read-modify-write penalty of RAID-5 as well as the
caching and prefetching of reads.

Note: the physical I/Os are 128 KB, regardless of the iozone size
setting.  This is expected, since 128 KB is the default recordsize
limit for ZFS.


The random read (with cache misses)
will stall the application, so it takes a lot of threads (16?) to  
keep

35 concurrent I/Os in the pipeline without prefetching.  The ZFS
prefetching algorithm is intelligent so it actually complicates the
interpretation of the data.


What bothers me is that that iostat is showing the 'disk' device as
not being saturated during the random read test. I'll post iostat
output that I captured yesterday to http://www.ilk.org/~ppk/Geek/ You
can clearly see the various test phases (sequential write, rewrite,
sequential read, reread, random read, then random write).


Is this a single thread?  Usually this means that you aren't creating
enough load. ZFS won't be prefetching (as much) for a random
read workload, so iostat will expose client bottlenecks.


You're peaking at 658 256KB random IOPS for the 3511, or ~66
IOPS per drive.  Since ZFS will max out at 128KB per I/O, the disks
see something more than 66 IOPS each.  The IOPS data from
iostat would be a better metric to observe than bandwidth.  These
drives are good for about 80 random IOPS each, so you may be
close to disk saturation.  The iostat data for IOPS and svc_t will
confirm.


But ... if I am saturating the 3511 with one thread, then why do I get
many times that performance with multiple threads ?


The T2000 data (sheet 3) shows pretty consistently around
90 256KB IOPS per drive. Like the 3511 case, this is perhaps 20%
less than I would expect, perhaps due to the measurement.


I ran the T2000 test to see if 10U8 behaved better and to make sure I
wasn't seeing an oddity of the 480 / 3511 case. I wanted to see if the
random read bahavior was similar, and it was (in relative terms).


Also, the 3511 RAID-5 configuration will perform random reads at
around 1/2 IOPS capacity if the partition offset is 34.  This was the
default long ago.  The new default is 256.


Our 3511's have been running 421F (latest) for a long time :-) We are
religious about keeping all the 3511 FW current and matched.



The reason is that with
a 34 block offset, you are almost guaranteed that a larger I/O will
stride 2 disks.  You won't notice this as easily with a single  
thread,

but it will be measurable with more threads. Double check the
offset with prtvtoc or format.


How do I check offset ... format - verify from one of the  

Re: [zfs-discuss] ZFS Random Read Performance

2009-11-25 Thread Richard Elling

more below...

On Nov 25, 2009, at 7:10 AM, Paul Kraus wrote:


I posted baseline stats at http://www.ilk.org/~ppk/Geek/

baseline test was 1 thread, 3 GiB file, 64KiB to 512 KiB record size

480-3511-baseline.xls is an iozone output file

iostat-baseline.txt is the iostat output for the device in use  
(annotated)


I also noted an odd behavior yesterady and have not had a chance to
better qualify it. I was testing various combinations of vdev
quantities and mirror quantities.

As I changed the number of vdevs (stripes) from 1 through 8 (all
backed buy paritions on the same logical disk on the 3511) there was
no real change in sequential write, random write, or random read
performance. Sequential read performance did show a drop from 216
MiB/sec at 1 vdev to 180 MiB/sec. at 8 vdevs. This was about as
expected.

As I changed the number of mirro components things got interesting.
Keep in mind that I only have one 3511 for testing right now, I had to
use partitions from two other production 3511's to get three mirror
components on different arrays. As expected, as I went from 1 to 2 to
3 mirror components the write performance did not change, but the read
performance was interesting... see below:

read performance
mirrors  sequential  random
1  174 MiB/sec.  23 MiB/sec.
2  229 MiB/sec.  30 MiB/sec.
3  223 MiB/sec.  125 MiB/sec.

What they heck happened here ? 1 to 2 mirrors saw a large increase in
sequential read perfromance and from 2 to 3 mirrors show a HUGE
increase in random read performance. It feels like the behavior of
the zfs code changed between 2 and 3 mirrors for the random read data.


I can't explain this.  It may require a detailed understanding of the
hardware configuration to identify the potential bottleneck.

The ZFS mirroring code doesn't care how many mirrors there are, it
just goes through the list.  If the performance is not symmetrical from
all sides of the mirror, then YMMV.


Now to investigate further, I tried multiple mirrors components on the
same array (my test 3511), not that you would do this in production,
but I was curious what would happen. In this case the throughput
degraded across the board as I added mirror components, as one would
expect. In the random read case the array was delivering less overall
performance than it was when it was one part of the earlier test (16
MiB/sec. combined vs. 1/3 of 125 MiB/sec.) See sheet 7 of
http://www.ilk.org/~ppk/Geek/throughput-summary.ods for these test
results. Sheet 8 is the last test I did last night, using the NRAID
logical disk type to try to get the 3511 to pass a disk through to
zfs, but get the advantage of the cache on the 3511. I'm not sure what
to read into those numbers.


I read it as the single array, as configured, with 10+1 RAID-5 can  
deliver

around 130 random read IOPS @ 128 KB.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs: questions on ARC membership based on type/ordering of Reads/Writes

2009-11-25 Thread Andrew . Rutz

I am trying to understand the ARC's behavior based on different
permutations of (a)sync Reads and (a)sync Writes.

thank you, in advance


o does the data for a *sync-write* *ever* go into the ARC?
  eg, my understanding is that the data goes to the ZIL (and
  the SLOG, if present), but how does it get from the ZIL to the ZIO layer?
  eg, does it go to the ARC on its way to the ZIO ?
  o if the sync-write-data *does* go to the ARC, does it go to
the ARC *after* it is written to the ZIL's backing-store,
or does the data go to the ZIL and the ARC in parallel ?
o if a sync-write's data goes to the ARC and ZIL *in parallel*,
  then does zfs prevent an ARC-hit until the data is confirmed
  to be on the ZIL's nonvolatile media (eg, disk-platter or SLOG) ?
  or could a Read get an ARC-hit on a block *before* it's written
  to zil's backing-store?


o is the DMU where the Serialization of transactions occurs?

o if an async-Write for block-X hits the Serializer before a Read
  for block-X hits the Serializer, i am assuming the Read can
  pass the async-Write; eg, the Read is *not* pended behind the
  async-write.  however, if a Read hits the Serializer after a
  *sync*-write, then i'm assuming the Read is pended until
  the sync-write is written to the ZIL's nonvolatile media.
  o if a Read passes an async-write, then i'm assuming the Read
can be satisfied by either the arc, l2arc, or disk.

o it's stated that the L2ARC is for random-reads.  however, there's
  nothing to prevent the L2ARC from containing blocks derived from
  *sequential*-reads, right ?   also, blocks from async-writes can
  also live in l2arc, right?  how about sync-writes ?

o is the l2arc literally simply a *larger* ARC?  eg, does the l2arc
  obey the normal cache property where everything that is in the L1$
  (eg, ARC) is also in the L2$ (eg, l2arc) ?  (I have a feeling that
  the set-theoretic intersection of ARC and L2ARC is empty (for some
  reason).
  o does the l2arc use the ARC algorithm (as the name suggests) ?

thank you,

/andrew
Solaris RPE
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] proposal partial/relative paths for zfs(1)

2009-11-25 Thread Mike Gerdts
Is there still any interest in this?  I've done a bit of hacking (then
searched for this thread - I picked -P instead of -c)...

$ zfs get -P compression,dedup /var
NAMEPROPERTY VALUE  SOURCE
rpool/ROOT/zfstest  compression  on inherited from rpool/ROOT
rpool/ROOT/zfstest  dedupoffdefault

$ pfexec zfs snapshot -P @now
Creating snapshot rpool/export/h...@now

Of course create/mkdir would make it into the eventual implementation
as well.  For those missing this thread in their mailboxes, the
conversation is archived at
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-July/019762.html.

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-July/019762.html

Mike


On Thu, Jul 10, 2008 at 4:42 AM, Darren J Moffat darren.mof...@sun.com wrote:
 I regularly create new zfs filesystems or snapshots and I find it
 annoying that I have to type the full dataset name in all of those cases.

 I propose we allow zfs(1) to infer the part of the dataset name upto the
 current working directory.  For example:

 Today:

 $ zfs create cube/builds/darrenm/bugs/6724478

 With this proposal:

 $ pwd
 /cube/builds/darrenm/bugs
 $ zfs create 6724478

 Both of these would result in a new dataset cube/builds/darrenm/6724478

 This will need some careful though about how to deal with cases like this:

 $ pwd
 /cube/builds/
 $ zfs create 6724478/test

 What should that do ? should it create cube/builds/6724478 and
 cube/builds/6724478/test ?  Or should it fail ?  -p already provides
 some capbilities in this area.

 Maybe the easiest way out of the ambiquity is to add a flag to zfs
 create for the partial dataset name eg:

 $ pwd
 /cube/builds/darrenm/bugs
 $ zfs create -c 6724478

 Why -c ?  -c for current directory  -p partial is already taken to
 mean create all non existing parents and -r relative is already used
 consistently as recurse in other zfs(1) commands (as well as lots of
 other places).

 Alternately:

 $ pwd
 /cube/builds/darrenm/bugs
 $ zfs mkdir 6724478

 Which would act like mkdir does (including allowing a -p and -m flag
 with the same meaning as mkdir(1)) but creates datasets instead of
 directories.

 Thoughts ?  Is this useful for anyone else ?  My above examples are some
 of the shorter dataset names I use, ones in my home directory can be
 even deeper.

 --
 Darren J Moffat
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best practices for zpools on zfs

2009-11-25 Thread Peter Jeremy
On 2009-Nov-24 14:07:06 -0600, Mike Gerdts mger...@gmail.com wrote:
On Tue, Nov 24, 2009 at 1:39 PM, Richard Elling
richard.ell...@gmail.com wrote:
 Also, the performance of /dev/*random is not very good.  So prestaging
 lots of random data will be particularly challenging.

This depends on the random number generation algorithm used in the
kernel.  I get 50MB/sec out of FreeBSD on 3.2GHz P4 (using Yarrow).
In any case, you don't need crypto-grade random numbers, just data
that is different and uncompressible - there are lots of relatively
simple RNGs that can deliver this with far greater speed.

I was thinking that a bignum library such as libgmp could be handy to
allow easy bit shifting of large amounts of data.  That is, fill a 128
KB buffer with random data then do bitwise rotations for each
successive use of the buffer.  Unless my math is wrong, it should
allow 128 KB of random data to be write 128 GB of data with very
little deduplication or compression.  A much larger data set could be
generated with the use of a 128 KB linear feedback shift register...

This strikes me as much harder to use than just filling the buffer
with 8/32/64-bit random numbers from a linear congruential generator,
lagged fibonacci generator, mersenne twister or even random(3)

http://en.wikipedia.org/wiki/List_of_random_number_generators

-- 
Peter Jeremy


pgpO9mAWzbb7x.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs: questions on ARC membership based on type/ordering of Reads/Writes

2009-11-25 Thread Richard Elling

On Nov 25, 2009, at 11:55 AM, andrew.r...@sun.com wrote:


I am trying to understand the ARC's behavior based on different
permutations of (a)sync Reads and (a)sync Writes.

thank you, in advance


o does the data for a *sync-write* *ever* go into the ARC?


always


 eg, my understanding is that the data goes to the ZIL (and
 the SLOG, if present), but how does it get from the ZIL to the ZIO  
layer?


ZIL is effectively write-only.  It is only read when the pool is  
imported.



 eg, does it go to the ARC on its way to the ZIO ?


ARC is the cache for buffering data.


 o if the sync-write-data *does* go to the ARC, does it go to
   the ARC *after* it is written to the ZIL's backing-store,
   or does the data go to the ZIL and the ARC in parallel ?


A sync write returns when the data is written to the ZIL.
An async write returns when the data is in the ARC, and later
the unwritten contents of the ARC are pushed to the pool when
the transaction group is committed.


   o if a sync-write's data goes to the ARC and ZIL *in parallel*,
 then does zfs prevent an ARC-hit until the data is confirmed
 to be on the ZIL's nonvolatile media (eg, disk-platter or SLOG) ?
 or could a Read get an ARC-hit on a block *before* it's written
 to zil's backing-store?


In my mind, the ARC and ZIL are orthogonal.


o is the DMU where the Serialization of transactions occurs?


Serialization?


o if an async-Write for block-X hits the Serializer before a Read
 for block-X hits the Serializer, i am assuming the Read can
 pass the async-Write; eg, the Read is *not* pended behind the
 async-write.  however, if a Read hits the Serializer after a
 *sync*-write, then i'm assuming the Read is pended until
 the sync-write is written to the ZIL's nonvolatile media.
 o if a Read passes an async-write, then i'm assuming the Read
   can be satisfied by either the arc, l2arc, or disk.


I think you are asking if write order is preserved. The answer is yes.


o it's stated that the L2ARC is for random-reads.  however, there's
 nothing to prevent the L2ARC from containing blocks derived from
 *sequential*-reads, right ?   also, blocks from async-writes can
 also live in l2arc, right?  how about sync-writes ?


Blocks which are not yet committed to the pool are locked in the
ARC so they can't be evicted. Once committed, the lock is removed.


o is the l2arc literally simply a *larger* ARC?  eg, does the l2arc
 obey the normal cache property where everything that is in the L1$
 (eg, ARC) is also in the L2$ (eg, l2arc) ?  (I have a feeling that
 the set-theoretic intersection of ARC and L2ARC is empty (for some
 reason).


No. The L2ARC is not in the datapath between the ARC and media.
Further, data is not evicted from the ARC into the L2ARC. Rather,
the L2ARC is filled from data near the eviction ends of the MRU and
MFU lists. The movement of data to the L2ARC is throttled and
grouped in sequence, improving efficiency for devices which like
large writes, such as read-optimized flash.

Think of it this way. Data which is in the ARC is fed into the L2ARC.
If the data is later evicted from the ARC, it can still live in the  
L2ARC.

When the L2ARC has lower read latency then the pool's media,
then it can improve performance because the data can be read from
L2ARC instead of the pool. This fits the general definition of a cache,
but does not work the same way as multilevel CPU caches.


 o does the l2arc use the ARC algorithm (as the name suggests) ?


Yes, but it really isn't separate from the ARC, from a management  
point of

view. To fully understand it, you need to know about how the metadata
for each buffer in the ARC is managed.  This will introduce the concept
of the ghosts, and the L2ARC is a simple extension.  The comments
in the source are nicely descriptive, and you might consider reading  
them

through once, even if you don't dive into the code itself:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs-raidz - simulate disk failure

2009-11-25 Thread Daniel Carosone
 [verify on real hardware and share results]
 Agree 110%.

Good :)

  Yanking disk controller and/or power cables is an
  easy and obvious test.

 The problem is that yanking a disk tests the failure
 mode of yanking a disk.

Yes, but the point is that it's a cheap and easy test, so you might as well do 
it -- just beware of what it does, and most importantly does not, tell you. 
It's a valid scenario to test regardless, you want to be sure that you can yank 
a disk to replace it, without a bus hang or other hotplug problem on your 
hardware.

  Testing scenarios that involve things like
  disk firmware behaviour in response to 
  bad reads is harder -

 If you wish to test the failure modes you  
 are likely to see, then you need a more 
 sophisticated test rig that will emulate  
 a device and inject the sorts of faults
 you expect.

This is one reason I like to keep faulty disks! :)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs-raidz - simulate disk failure

2009-11-25 Thread Richard Elling

On Nov 25, 2009, at 4:43 PM, Daniel Carosone wrote:


[verify on real hardware and share results]

Agree 110%.


Good :)


Yanking disk controller and/or power cables is an
easy and obvious test.



The problem is that yanking a disk tests the failure
mode of yanking a disk.


Yes, but the point is that it's a cheap and easy test, so you might  
as well do it -- just beware of what it does, and most importantly  
does not, tell you. It's a valid scenario to test regardless, you  
want to be sure that you can yank a disk to replace it, without a  
bus hang or other hotplug problem on your hardware.


The next problem is that although a spec might say that hot-plugging
works, that doesn't mean the implementers support it.  To wit, there are
well known SATA controllers that do not support hot plug.  So what
good is the test if the hardware/firmware is known to not support it?
Speaking practically, do you evaluate your chipset and disks for hotplug
support before you buy?


Testing scenarios that involve things like
disk firmware behaviour in response to
bad reads is harder -



If you wish to test the failure modes you
are likely to see, then you need a more
sophisticated test rig that will emulate
a device and inject the sorts of faults
you expect.


This is one reason I like to keep faulty disks! :)


Me too.  I still have a SATA drive that breaks POST for every mobo
I've come across.  Wanna try hot plug with it? :-)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs-raidz - simulate disk failure

2009-11-25 Thread Eric D. Mudama

On Wed, Nov 25 at 16:43, Daniel Carosone wrote:

The problem is that yanking a disk tests the failure
mode of yanking a disk.


Yes, but the point is that it's a cheap and easy test, so you might
as well do it -- just beware of what it does, and most importantly
does not, tell you. It's a valid scenario to test regardless, you
want to be sure that you can yank a disk to replace it, without a
bus hang or other hotplug problem on your hardware.


Agreed.  It's also a very effective way of preventing your drive from
responding to commands, to test how the system behaves when a drive
stops responding.  Some significant percentage of device failures will
look similar.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heads up: SUNWzfs-auto-snapshot obsoletion in snv 128

2009-11-25 Thread Daniel Carosone
 So we also need a txg dirty or similar
 property to be exposed from the kernel.

Or not..

if you find this condition, defer, but check again in a minute (really, after a 
full txg_interval has passed) rather than on the next scheduled snapshot.

on that next check, if the txg has advanced again, snapshot.  if not, defer 
until the next scheduled snapshot as usual.  Yes, the txg may now be dirty this 
second time around - but it's after the snapshot was due, so these writes will 
be collected in the next snapshot.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs-raidz - simulate disk failure

2009-11-25 Thread Daniel Carosone
 Speaking practically, do you evaluate your chipset
 and disks for hotplug support before you buy?

Yes, if someone else has shared their test results previously.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X45xx storage vs 7xxx Unified storage

2009-11-25 Thread Miles Nordin
 et == Erik Trimble erik.trim...@sun.com writes:

et I'd still get the 7310 hardware.
et Worst case scenario is that you can blow away the AmberRoad

okay but, AIUI he was saying pricing is 6% more for half as much
physical disk.  This is also why it ``uses less energy'' while
supposedly filling the same role: fishworks clustering is based on SAS
multi-initiator, on SAS fan...uh,...fan-in?  switches, while OP's
home-rolled cluster plan was based on copying the data to another
zpool.  remember pricing is based on ``market forces'': it's not dumb,
is the opposite of dumb, but...under ``market forces'' pricing if you
are paying for clever-schemes you can't use, YHL.


pgplSxaec58ln.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X45xx storage vs 7xxx Unified storage

2009-11-25 Thread Erik Trimble

Miles Nordin wrote:

et == Erik Trimble erik.trim...@sun.com writes:



et I'd still get the 7310 hardware.
et Worst case scenario is that you can blow away the AmberRoad

okay but, AIUI he was saying pricing is 6% more for half as much
physical disk.  This is also why it ``uses less energy'' while
supposedly filling the same role: fishworks clustering is based on SAS
multi-initiator, on SAS fan...uh,...fan-in?  switches, while OP's
home-rolled cluster plan was based on copying the data to another
zpool.  remember pricing is based on ``market forces'': it's not dumb,
is the opposite of dumb, but...under ``market forces'' pricing if you
are paying for clever-schemes you can't use, YHL.
  

No, 6% LESS for the 7310 solution, vs the dual x4540 solution.

The key here is Usable disk space.  Yes, the X4540 comes with 2x the 
disk space, but having to cluster them via non-shared storage, you 
effectively eliminate that advantage.  Not to mention that expanding a 
clustered X4540 either means you have to buy 2x the required storage 
(i.e. attach another array to each x4540), or you do the exact same 
thing as with a 7310 (i.e. dual-attach an array to both).



You certainly are paying some premium for the A-R software; however, I 
was stating the worst-case scenario where he finds he can't make use of 
the A-R software. He's still left with a hardware solution that is 
superior to the dual X4540 (in my opinion).   That is, software aside, 
my opinion is that a clustered X4140 with shared J4400 chassis is a 
better idea than redundant X4540 setup.  With or without the AR 
software.  The AR software just makes the configuration of the 7310 
extremely simple, which is no small win in and of itself.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacing log with SSD on Sol10 u8

2009-11-25 Thread Jorgen Lundman



Interesting. Unfortunately, I can not zpool offline, nor zpool
detach, nor zpool remove the existing c6t4d0s0 device.



I thought perhaps we could boot something newer than b125 [*1] and I would be 
able to remove the slog device that is too big.


The dev-127.iso does not boot [*2] due to splashimage, so I had to edit the ISO 
to remove that for booting.


After booting with -B console=ttya, I find that it can not add the /dev/dsk 
entries for the 24 HDDs, since / is on a too-small ramdisk. Disk-full messages 
ensue. Yay!


After I have finally imported the pools, without upgrading (since I have to boot 
back to Sol 10 u8 for production), I attempt to remove the slog that is no 
longer needed:



# zpool remove zpool1 c6t4d0s0
cannot remove c6t4d0s0: pool must be upgrade to support log removal


Sigh.


Lund



[*1]
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6574286

[*2]
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6739497




--
Jorgen Lundman   | lund...@lundman.net
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss