Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Jeroen Roodhart
Oh, one more comment. If you don't mirror your ZIL, and your unmirrored SSD
goes bad, you lose your whole pool. Or at least suffer data corruption.

Hmmm, I thought that in that case ZFS reverts to the regular on disks ZIL?

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Jeroen Roodhart
The write cache is _not_ being disabled. The write cache is being marked
as non-volatile.

Of course you're right :) Please filter my postings with a sed 's/write 
cache/write cache flush/g' ;)

BTW, why is a Sun/Oracle branded product not properly respecting the NV
bit in the cache flush command? This seems remarkably broken, and leads
to the amazingly bad advice given on the wiki referenced above.

I suspect it has something to do with emulating disk semantics over PCIE. 
Anyway, this did get us stumped in the beginning, performance wasn't better 
than when using an OCZ Vertex Turbo ;) 

By the way, the URL to the reference is part of the official F20 product 
documentation (that's how we found it in the first place)...

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to destroy iscsi dataset?

2010-03-31 Thread Tonmaus
Hi,

even if you didn't specify so below (both, Comstar and legacy target services 
are inactive) I assume that you have been using Comstar, right?
In that case, the questions are:

- is there still a view on the targets? (check stmfadm)
- is there still a LU mapped? (check sbdadm)

cheers,

Tonmaus
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Karsten Weiss
 I stand corrected.  You don't lose your pool.  You don't have corrupted
 filesystem.  But you lose whatever writes were not yet completed, so if
 those writes happen to be things like database transactions, you could have
 corrupted databases or files, or missing files if you were creating them at
 the time, and stuff like that.  AKA, data corruption.
 
 But not pool corruption, and not filesystem corruption.

Yeah, that's a big difference! :)

Of course we could not live with pool or fs corruption. However, we can live 
with
the fact the NFS written data is not all on disk in case of a server crash 
although
the NFS client could rely on the write guaranteed by the NFS protocol. I.e. we 
do
not use it for db transactions or something like that.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Karsten Weiss
Hi Adam,

 Very interesting data. Your test is inherently
 single-threaded so I'm not surprised that the
 benefits aren't more impressive -- the flash modules
 on the F20 card are optimized more for concurrent
 IOPS than single-threaded latency.

Thanks for your reply. I'll probably test the multiple write case, too.

But frankly at the moment I care the most about the single-threaded case
because if we put e.g. user homes on this server I think they would be
severely disappointed if they would have to wait 2m42s just to extract a rather
small 50 MB tarball. The default 7m40s without SSD log were unacceptable
and we were hoping that the F20 would make a big difference and bring the
performance down to acceptable runtimes. But IMHO 2m42s is still too slow
and disabling the ZIL seems to be the only option.

Knowing that 100s of users could do this in parallel with good performance
is nice but it does not improve the situation for the single user which only
cares for his own tar run. If there's anything else we can do/try to improve
the single-threaded case I'm all ears.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Brent Jones
On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss
k.we...@science-computing.de wrote:
 Hi Adam,

 Very interesting data. Your test is inherently
 single-threaded so I'm not surprised that the
 benefits aren't more impressive -- the flash modules
 on the F20 card are optimized more for concurrent
 IOPS than single-threaded latency.

 Thanks for your reply. I'll probably test the multiple write case, too.

 But frankly at the moment I care the most about the single-threaded case
 because if we put e.g. user homes on this server I think they would be
 severely disappointed if they would have to wait 2m42s just to extract a 
 rather
 small 50 MB tarball. The default 7m40s without SSD log were unacceptable
 and we were hoping that the F20 would make a big difference and bring the
 performance down to acceptable runtimes. But IMHO 2m42s is still too slow
 and disabling the ZIL seems to be the only option.

 Knowing that 100s of users could do this in parallel with good performance
 is nice but it does not improve the situation for the single user which only
 cares for his own tar run. If there's anything else we can do/try to improve
 the single-threaded case I'm all ears.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Use something other than Open/Solaris with ZFS as an NFS server?  :)

I don't think you'll find the performance you paid for with ZFS and
Solaris at this time. I've been trying to more than a year, and
watching dozens, if not hundreds of threads.
Getting half-ways decent performance from NFS and ZFS is impossible
unless you disable the ZIL.

You'd be better off getting NetApp

-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Arne Jansen
Brent Jones wrote:
 
 I don't think you'll find the performance you paid for with ZFS and
 Solaris at this time. I've been trying to more than a year, and
 watching dozens, if not hundreds of threads.
 Getting half-ways decent performance from NFS and ZFS is impossible
 unless you disable the ZIL.

A few days ago I posted to nfs-discuss with a proposal to add some
mount/share options to change semantics of a nfs-mounted filesystem
so that they parallel those of a local filesystem.
The main point is that data gets flushed to stable storage only if the
client explicitly requests so via fsync or O_DSYNC, not implicitly
with every close().
That would give you the performance you are seeking without sacrificing
data integrity for applications that need it.

I get the impression that I'm not the only one who could be interested
in that ;)

-Arne

 
 You'd be better off getting NetApp
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Need advice on handling 192 TB of Storage on hardware raid storage

2010-03-31 Thread Dedhi Sujatmiko

Dear all,

I have a hardware based array storage with a capacity of 192TB and being 
sliced into 64 LUNs of 3TB.
What will be the best way to configure the ZFS on this? Of course we are 
not requiring the self healing capability of the ZFS. We just want the 
capability of handling big size file system and performance.


Currently we are running using Solaris 10 May 2009 (Update 7), and 
configure the ZFS where :

a. 1 hardware LUN (3TB) will become 1 zpool
b. 1 zpool will become 1 ZFS file system
c. 1 ZFS file system will become 1 mountpoint (obviously).

The problem we have is that when the customer runs the I/O in parallel 
to the 64 file systems, the kernel usage (%sys) shot up very high to the 
90% region and the IOPS level is degrading. It can be seen also that 
during that time the storage's own front end CPU does not change much, 
which means the bottleneck is not on the hardware storage level, but 
somewhere inside the Solaris box.


Is there any experience of having the similar setup like the one I have? 
Or anybody can point me to an information on what will be the best way 
to deal with the hardware storage on this size?


Please advice and thanks in advance,

Dedhi
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] bit-flipping in RAM...

2010-03-31 Thread Erik Trimble

Orvar's post over in opensol-discuss has me thinking:

After reading the paper and looking at design docs, I'm wondering if 
there is some facility to allow for comparing data in the ARC to it's 
corresponding checksum.  That is, if I've got the data I want in the 
ARC, how can I be sure it's correct (and free of hardware memory 
errors)?  I'd assume the way is to also store absolutely all the 
checksums for all blocks/metadatas being read/written in the ARC (which, 
of course, means that only so much RAM corruption can be compensated 
for), and do a validation when that every time that block is 
used/written from the ARC.  You'd likely have to do constant metadata 
consistency checking, and likely have to hold multiple copies of 
metadata in-ARC to compensate for possible corruption.  I'm assuming 
that this has at least been explored, right?


(the researchers used non-ECC RAM, so honestly, I think it's a bit 
unrealistic to expect that your car will win the Indy 500 if you put a 
Yugo engine in it) - normally, this problem is exactly what you have 
hardware ECC and memory scrubbing for at the hardware level.


I'm not saying that ZFS should consider doing this - doing a validation 
for in-memory data is non-trivially expensive in performance terms, and 
there's only so much you can do and still expect your machine to 
survive.  I mean, I've used the old NonStop stuff, and yes, you can 
shoot them with a .45 and it likely will still run, but wacking them 
with a bazooka still is guarantied to make them, well, Non-NonStop.


-Erik





 Original Message 
Subject:Re: [osol-discuss] Any news about 2010.3?
Date:   Wed, 31 Mar 2010 01:06:45 PDT
From:   Orvar Korvar knatte_fnatte_tja...@yahoo.com
To: opensolaris-disc...@opensolaris.org



If you value your data, you should reconsider. But if your data is not 
important, then skip ZFS.

File system data corruption test by researcher:
http://blogs.zdnet.com/storage/?p=169

ZFS data corruption test by researchers:
http://www.cs.wisc.edu/wind/Publications/zfs-corruption-fast10.pdf
--
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-disc...@opensolaris.org


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Karsten Weiss
 Nobody knows any way for me to remove my unmirrored
 log device.  Nobody knows any way for me to add a mirror to it (until

Since snv_125 you can remove log devices. See
http://bugs.opensolaris.org/view_bug.do?bug_id=6574286

I've used this all the time during my testing and was able to remove both
mirrored and unmirrored log devices without any problems (and without
reboot). I'm using snv_134.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] bit-flipping in RAM...

2010-03-31 Thread Casper . Dik


I'm not saying that ZFS should consider doing this - doing a validation 
for in-memory data is non-trivially expensive in performance terms, and 
there's only so much you can do and still expect your machine to 
survive.  I mean, I've used the old NonStop stuff, and yes, you can 
shoot them with a .45 and it likely will still run, but wacking them 
with a bazooka still is guarantied to make them, well, Non-NonStop.

If we scrub the memory anyway, why not include the check of the ZFS 
checksums which are already in memory?

OTOH, zfs gets a lot of mileage out of cheap hardware and we know what the 
limitations are when you don't use ECC; the industry must start to require 
that all chipsets support ECC.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Robert Milkowski



On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss
   
Use something other than Open/Solaris with ZFS as an NFS server?  :)


I don't think you'll find the performance you paid for with ZFS and
Solaris at this time. I've been trying to more than a year, and
watching dozens, if not hundreds of threads.
Getting half-ways decent performance from NFS and ZFS is impossible
unless you disable the ZIL.

   


Well, for lots of environments disabling ZIL is perfectly acceptable.
And frankly the reason you get better performance out of the box on 
Linux as NFS server is that it actually behaves like with disabled ZIL - 
so disabling ZIL on ZFS for NFS shares is no worse than using Linux here 
or any other OS which behaves in the same manner. Actually it makes it 
better as even if ZIL is disabled ZFS filesystem is always consisten on 
a disk and you still get all the other benefits from ZFS.


What would be useful though is to be able to easily disable ZIL per 
dataset instead of OS wide switch.
This feature has already been coded and tested and awaits a formal 
process to be completed in order to get integrated. Should be rather 
sooner than later.



You'd be better off getting NetApp
   
Well, spend some extra money on a really fast NVRAM solution for ZIL and 
you will get much faster ZFS environment than NetApp and still you will 
spend much less money. Not to mention all the extra flexibity compared 
to NetApp.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Robert Milkowski



Just to make sure you know ... if you disable the ZIL altogether, and
   

you
 

have a power interruption, failed cpu, or kernel halt, then you're
   

likely to
 

have a corrupt unusable zpool, or at least data corruption.  If that
   

is
 

indeed acceptable to you, go nuts.  ;-)
   

I believe that the above is wrong information as long as the devices
involved do flush their caches when requested to.  Zfs still writes
data in order (at the TXG level) and advances to the next transaction
group when the devices written to affirm that they have flushed their
cache.  Without the ZIL, data claimed to be synchronously written
since the previous transaction group may be entirely lost.

If the devices don't flush their caches appropriately, the ZIL is
irrelevant to pool corruption.
 

I stand corrected.  You don't lose your pool.  You don't have corrupted
filesystem.  But you lose whatever writes were not yet completed, so if
those writes happen to be things like database transactions, you could have
corrupted databases or files, or missing files if you were creating them at
the time, and stuff like that.  AKA, data corruption.

But not pool corruption, and not filesystem corruption.


   
Which is an expected behavior when you break NFS requirements as Linux 
does out of the box.
Disabling ZIL on a nfs server makes it no worse than the standard Linux 
behaviour - now you get decent performance at a cost of some data to get 
corrupted from a nfs client point of view. But then there are 
environments when it is perfectly acceptable as you there are not 
running critical databases but rather user home directories and zfs will 
flush a transaction maximum after 30s currently so user won't be able to 
loose more than last 30s if the nfs server would suddenly lost power.


To clarify - if ZIL is disabled it makes no difference at all for a 
pool/filesystem level consistency.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] bit-flipping in RAM...

2010-03-31 Thread Erik Trimble

casper@sun.com wrote:
  
I'm not saying that ZFS should consider doing this - doing a validation 
for in-memory data is non-trivially expensive in performance terms, and 
there's only so much you can do and still expect your machine to 
survive.  I mean, I've used the old NonStop stuff, and yes, you can 
shoot them with a .45 and it likely will still run, but wacking them 
with a bazooka still is guarantied to make them, well, Non-NonStop.



If we scrub the memory anyway, why not include the check of the ZFS 
checksums which are already in memory?


OTOH, zfs gets a lot of mileage out of cheap hardware and we know what the 
limitations are when you don't use ECC; the industry must start to require 
that all chipsets support ECC.


Caspe
Reading the paper was interesting, as it highlighted all the places 
where ZFS skips validation.  There's a lot of places. In many ways, 
fixing this would likely make ZFS similar to AppleTalk whose notorious 
performance (relative to Ethernet) was caused by what many called the 
Are You Sure? design.  Double and Triple checking absolutely 
everything has it's costs.


And, yes, we really should just force computer manufacturers to use ECC 
in more places (not just RAM) - as densities and data volumes increase, 
we are more likely to see errors, and without proper hardware checking, 
we're really going out on a limb here to be able to trust what the 
hardware says. And, let's face it - hardware error correction is /so/ 
much faster than doing it in software.






--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Robert Milkowski



standard ZIL:   7m40s  (ZFS default)
1x SSD ZIL:  4m07s  (Flash Accelerator F20)
2x SSD ZIL:  2m42s  (Flash Accelerator F20)
2x SSD mirrored ZIL:   3m59s  (Flash Accelerator F20)
3x SSD ZIL:  2m47s  (Flash Accelerator F20)
4x SSD ZIL:  2m57s  (Flash Accelerator F20)
disabled ZIL:   0m15s
(local extraction0m0.269s)

I was not so much interested in the absolute numbers but rather in the
relative
performance differences between the standard ZIL, the SSD ZIL and the
disabled
ZIL cases.
 

Oh, one more comment.  If you don't mirror your ZIL, and your unmirrored SSD
goes bad, you lose your whole pool.  Or at least suffer data corruption.


   
This is not true. If ZIL device would die while pool is imported then 
ZFS would start using z ZIL withing a pool and continue to operate.


On the other hand if your server would suddenly lost power and then when 
you power it up later on and ZFS detects that the ZIL is broken/gone it 
will require a sysadmin intervation to force the pool import and yes 
possibly loose some data.


But how is it different from any other solution where your log is put on 
a separate device?
Well, it is actually different. With ZFS you can still guearantee it to 
be consistent on-disk while others generally can't and often you will 
have to do fsck to even mount a fs in r/w...


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool split problem?

2010-03-31 Thread Damon Atkins
Why do we still need /etc/zfs/zpool.cache file??? 
(I could understand it was useful when zfs import was slow)

zpool import is now multi-threaded 
(http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844191), hence a 
lot faster,  each disk contains the hostname 
(http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6282725) , if a 
pool contains the same hostname as the server then import it.

ie This bug should not be a problem any more 
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6737296 with a 
multi-threaded zpool import.

HA Storage should be changed to just do a zpool -h import mypool instead of 
using a private zpool.cache file (-h being ignore if the pool was imported by a 
different host, and maybe a noautoimport property is need on a zpool so 
clustering software can decided to import it by hand as it was)

And therefore this zpool zplit problem would be fixed.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Simultaneous failure recovery

2010-03-31 Thread Robert Milkowski


I have a pool (on an X4540 running S10U8) in which a disk failed, and the
hot spare kicked in. That's perfect. I'm happy.

Then a second disk fails.

Now, I've replaced the first failed disk, and it's resilvered and I have my
hot spare back.

But: why hasn't it used the spare to cover the other failed drive? And
can I hotspare it manually?  I could do a straight replace, but that
isn't quite the same thing.

   


It seems like it is even driven. Hmmm.. perhaps it shouldn't be.

Anyway you can do zpool replace and it is the same thing, why wouldn't it?

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Simultaneous failure recovery

2010-03-31 Thread Peter Tribble
On Tue, Mar 30, 2010 at 10:42 PM, Eric Schrock eric.schr...@oracle.com wrote:

 On Mar 30, 2010, at 5:39 PM, Peter Tribble wrote:

 I have a pool (on an X4540 running S10U8) in which a disk failed, and the
 hot spare kicked in. That's perfect. I'm happy.

 Then a second disk fails.

 Now, I've replaced the first failed disk, and it's resilvered and I have my
 hot spare back.

 But: why hasn't it used the spare to cover the other failed drive? And
 can I hotspare it manually?  I could do a straight replace, but that
 isn't quite the same thing.

 Hot spares are only activated in response to a fault received by the 
 zfs-retire FMA agent.  There is no notion that the spares should be 
 re-evaluated when they become available at a later point in time.  Certainly 
 a reasonable RFE, but not something ZFS does today.

Definitely an RFE I would like.

 You can 'zpool attach' the spare like a normal device - that's all that the 
 retire agent is doing under the hood.

So, given:

NAMESTATE READ WRITE CKSUM
images  DEGRADED 0 0 0
  raidz1DEGRADED 0 0 0
c2t0d0  FAULTED  4 0 0  too many errors
c3t0d0  ONLINE   0 0 0
c4t0d0  ONLINE   0 0 0
c5t0d0  ONLINE   0 0 0
c0t1d0  ONLINE   0 0 0
c1t1d0  ONLINE   0 0 0
c2t1d0  ONLINE   0 0 0
c3t1d0  ONLINE   0 0 0
c4t1d0  ONLINE   0 0 0
spares
  c5t7d0AVAIL

then it would be this?

zpool attach images c2t0d0 c5t7d0

which I had considered, but the man page for attach says The
existing device cannot be part of a raidz configuration.

If I try that it fails, saying:
/invalid vdev specification
use '-f' to override the following errors:
dev/dsk/c5t7d0s0 is reserved as a hot spare for ZFS pool images.
Please see zpool(1M).

Thanks!

-- 
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Simultaneous failure recovery

2010-03-31 Thread Robert Milkowski


On Tue, Mar 30, 2010 at 10:42 PM, Eric Schrockeric.schr...@oracle.com  wrote:
   

On Mar 30, 2010, at 5:39 PM, Peter Tribble wrote:

 

I have a pool (on an X4540 running S10U8) in which a disk failed, and the
hot spare kicked in. That's perfect. I'm happy.

Then a second disk fails.

Now, I've replaced the first failed disk, and it's resilvered and I have my
hot spare back.

But: why hasn't it used the spare to cover the other failed drive? And
can I hotspare it manually?  I could do a straight replace, but that
isn't quite the same thing.
   

Hot spares are only activated in response to a fault received by the zfs-retire 
FMA agent.  There is no notion that the spares should be re-evaluated when they 
become available at a later point in time.  Certainly a reasonable RFE, but not 
something ZFS does today.
 

Definitely an RFE I would like.

   

You can 'zpool attach' the spare like a normal device - that's all that the 
retire agent is doing under the hood.
 

So, given:

 NAMESTATE READ WRITE CKSUM
 images  DEGRADED 0 0 0
   raidz1DEGRADED 0 0 0
 c2t0d0  FAULTED  4 0 0  too many errors
 c3t0d0  ONLINE   0 0 0
 c4t0d0  ONLINE   0 0 0
 c5t0d0  ONLINE   0 0 0
 c0t1d0  ONLINE   0 0 0
 c1t1d0  ONLINE   0 0 0
 c2t1d0  ONLINE   0 0 0
 c3t1d0  ONLINE   0 0 0
 c4t1d0  ONLINE   0 0 0
 spares
   c5t7d0AVAIL

then it would be this?

zpool attach images c2t0d0 c5t7d0

which I had considered, but the man page for attach says The
existing device cannot be part of a raidz configuration.

If I try that it fails, saying:
/invalid vdev specification
use '-f' to override the following errors:
dev/dsk/c5t7d0s0 is reserved as a hot spare for ZFS pool images.
Please see zpool(1M).

Thanks!

   

You need to use zpool replace.
Once you fix the failed drive and it re-synchronizes a hot spare will 
detach automatically (regardless if you forced it to kick-in via zpool 
replace or if it did so due to FMA).


For more details see http://blogs.sun.com/eschrock/entry/zfs_hot_spares

--
Robert Milkowski
http://milek.blogspot.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Simultaneous failure recovery

2010-03-31 Thread Ian Collins

On 03/31/10 10:54 PM, Peter Tribble wrote:

On Tue, Mar 30, 2010 at 10:42 PM, Eric Schrockeric.schr...@oracle.com  wrote:
   

On Mar 30, 2010, at 5:39 PM, Peter Tribble wrote:

 

I have a pool (on an X4540 running S10U8) in which a disk failed, and the
hot spare kicked in. That's perfect. I'm happy.

Then a second disk fails.

Now, I've replaced the first failed disk, and it's resilvered and I have my
hot spare back.

But: why hasn't it used the spare to cover the other failed drive? And
can I hotspare it manually?  I could do a straight replace, but that
isn't quite the same thing.
   

Hot spares are only activated in response to a fault received by the zfs-retire 
FMA agent.  There is no notion that the spares should be re-evaluated when they 
become available at a later point in time.  Certainly a reasonable RFE, but not 
something ZFS does today.
 

Definitely an RFE I would like.

   

You can 'zpool attach' the spare like a normal device - that's all that the 
retire agent is doing under the hood.
 

So, given:

 NAMESTATE READ WRITE CKSUM
 images  DEGRADED 0 0 0
   raidz1DEGRADED 0 0 0
 c2t0d0  FAULTED  4 0 0  too many errors
 c3t0d0  ONLINE   0 0 0
 c4t0d0  ONLINE   0 0 0
 c5t0d0  ONLINE   0 0 0
 c0t1d0  ONLINE   0 0 0
 c1t1d0  ONLINE   0 0 0
 c2t1d0  ONLINE   0 0 0
 c3t1d0  ONLINE   0 0 0
 c4t1d0  ONLINE   0 0 0
 spares
   c5t7d0AVAIL

then it would be this?

zpool attach images c2t0d0 c5t7d0

which I had considered, but the man page for attach says The
existing device cannot be part of a raidz configuration.

If I try that it fails, saying:
/invalid vdev specification
use '-f' to override the following errors:
dev/dsk/c5t7d0s0 is reserved as a hot spare for ZFS pool images.
Please see zpool(1M).
   


What happens if you remove it as a spare first?

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] bit-flipping in RAM...

2010-03-31 Thread Zhu Han
The ECC enabled RAM should be very cheap quickly if the industry embraces it
in every computer. :-)

best regards,
hanzhu


On Wed, Mar 31, 2010 at 5:46 PM, Erik Trimble erik.trim...@oracle.comwrote:

 casper@sun.com wrote:



 I'm not saying that ZFS should consider doing this - doing a validation
 for in-memory data is non-trivially expensive in performance terms, and
 there's only so much you can do and still expect your machine to survive.  I
 mean, I've used the old NonStop stuff, and yes, you can shoot them with a
 .45 and it likely will still run, but wacking them with a bazooka still is
 guarantied to make them, well, Non-NonStop.



 If we scrub the memory anyway, why not include the check of the ZFS
 checksums which are already in memory?

 OTOH, zfs gets a lot of mileage out of cheap hardware and we know what the
 limitations are when you don't use ECC; the industry must start to require
 that all chipsets support ECC.

 Caspe

 Reading the paper was interesting, as it highlighted all the places where
 ZFS skips validation.  There's a lot of places. In many ways, fixing this
 would likely make ZFS similar to AppleTalk whose notorious performance
 (relative to Ethernet) was caused by what many called the Are You Sure?
 design.  Double and Triple checking absolutely everything has it's costs.

 And, yes, we really should just force computer manufacturers to use ECC in
 more places (not just RAM) - as densities and data volumes increase, we are
 more likely to see errors, and without proper hardware checking, we're
 really going out on a limb here to be able to trust what the hardware says.
 And, let's face it - hardware error correction is /so/ much faster than
 doing it in software.






 --
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA
 Timezone: US/Pacific (GMT-0800)

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Karsten Weiss
Hi  Jeroen, Adam!

 link. Switched write caching off with the following
 addition to the /kernel/drv/sd.conf file (Karsten: if
 you didn't do this already, you _really_ want to :)

Okay, I bite! :) format-inquiry on the F20 FMods disks returns:

# Vendor:   ATA
# Product:  MARVELL SD88SA02

So I put this in /kernel/drv/sd.conf and rebooted:

# KAW, 2010-03-31
# Set F20 FMod devices to non-volatile mode
# See 
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes
sd-config-list = ATA MARVELL SD88SA02, nvcache1;
nvcache1=1, 0x4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1;

Now the tarball extraction test with active ZIL finishes in ~0m32s!
I've tested with a mirrored SSD log and two separate SSD log devices.
The runtime is nearly the same. Compared to the 2m42s before the
/kernel/drv/sd.conf modification this is a huge improvement. The
performance with active ZIL would be acceptable now.

But is this mode of operation *really* safe?

FWIW zilstat during the test shows this:

   N-Bytes  N-Bytes/s N-Max-RateB-Bytes  B-Bytes/s B-Max-Rateops  =4kB 
4-32kB =32kB
 0  0  0  0  0  0  0  0 
 0  0
   103907210390721039072377241637724163772416610299 
   311  0
   152249615224961522496540262454026245402624874429 
   445  0
   229295222929522292952674611267461126746112931215 
   716  0
   232127223212722321272677478467747846774784931208 
   723  0
   230347223034722303472654950465495046549504897195 
   702  0
   632632632673382467338246733824935226 
   709  0
   219832821983282198328666828866682886668288926224 
   702  0
   217217217637337663733766373376878200 
   678  0
   218541621854162185416635289663528966352896874197 
   677  0
   221804022180402218040651673665167366516736897203 
   694  0
   243698424369842436984654950465495046549504885171 
   714  0

I.e. ~900 ops/s.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Edward Ned Harvey
 Use something other than Open/Solaris with ZFS as an NFS server?  :)
 
 I don't think you'll find the performance you paid for with ZFS and
 Solaris at this time. I've been trying to more than a year, and
 watching dozens, if not hundreds of threads.
 Getting half-ways decent performance from NFS and ZFS is impossible
 unless you disable the ZIL.
 
 You'd be better off getting NetApp

Hah hah.  I have a Sun X4275 server exporting NFS.  We have clients on all 4
of the Gb ethers, and the Gb ethers are the bottleneck, not the disks or
filesystem.

I suggest you either enable the WriteBack cache on your HBA, or add SSD's
for ZIL.  Performance is 5-10x higher this way than using naked disks.
But of course, not as high as it is with a disabled ZIL.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Jeroen Roodhart
Hi Karsten,

 But is this mode of operation *really* safe?

As far as I can tell it is. 

-The F20 uses some form of power backup that should provide power to the 
interface card long enough to get the cache onto solid state in case of power 
failure. 

-Recollecting from earlier threads here; in case the card fails (but not the 
host), there should be enough data residing in memory for ZFS to safely switch 
to the regular on disk ZIL.

-According to my contacts at Sun, the F20 is a viable replacement solution for 
the X25-E. 

-Switching write caching off seems to be officially recommended on the Sun 
performance wiki
 (translated to more sane defaults).

If I'm wrong here I'd like to know too, 'cause this is probably the way we're 
taking it in production.
 :)

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Edward Ned Harvey
  Nobody knows any way for me to remove my unmirrored
  log device.  Nobody knows any way for me to add a mirror to it (until
 
 Since snv_125 you can remove log devices. See
 http://bugs.opensolaris.org/view_bug.do?bug_id=6574286
 
 I've used this all the time during my testing and was able to remove
 both
 mirrored and unmirrored log devices without any problems (and without
 reboot). I'm using snv_134.

Aware.  Opensolaris can remove log devices.  Solaris cannot.  Yet.  But if
you want your server in production, you can get a support contract for
solaris.  Opensolaris cannot.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Jeroen Roodhart
Hi Richard,

For this case, what is the average latency to the F20?

I'm not giving the average since I only performed a single run here (still need 
to get autopilot set up :) ). However here is a graph of iostat IOPS/svc_t 
sampled in 10sec intervals during a run of untarring an eclipse tarbal 40 times 
from two hosts. I'm using 1 vmod here.

http://www.science.uva.nl/~jeroen/zil_1slog_e1000_iostat_iops_svc_t_10sec_interval.pdf

Maximum svc_t is around 2.7ms averaged over 10s.

Still wondering why this won't scale out though. We don't seem to be CPU bound, 
unless ZFS limits itself to max 30% cputime?

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Simultaneous failure recovery

2010-03-31 Thread Eric Schrock

On Mar 30, 2010, at 5:39 PM, Peter Tribble wrote:

 I have a pool (on an X4540 running S10U8) in which a disk failed, and the
 hot spare kicked in. That's perfect. I'm happy.
 
 Then a second disk fails.
 
 Now, I've replaced the first failed disk, and it's resilvered and I have my
 hot spare back.
 
 But: why hasn't it used the spare to cover the other failed drive? And
 can I hotspare it manually?  I could do a straight replace, but that
 isn't quite the same thing.

Hot spares are only activated in response to a fault received by the zfs-retire 
FMA agent.  There is no notion that the spares should be re-evaluated when they 
become available at a later point in time.  Certainly a reasonable RFE, but not 
something ZFS does today.

You can 'zpool attach' the spare like a normal device - that's all that the 
retire agent is doing under the hood.

Hope that helps,

- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Neil Perrin

On 03/30/10 20:00, Bob Friesenhahn wrote:

On Tue, 30 Mar 2010, Edward Ned Harvey wrote:


But the speedup of disabling the ZIL altogether is
appealing (and would
probably be acceptable in this environment).


Just to make sure you know ... if you disable the ZIL altogether, and 
you
have a power interruption, failed cpu, or kernel halt, then you're 
likely to

have a corrupt unusable zpool, or at least data corruption.  If that is
indeed acceptable to you, go nuts.  ;-)


I believe that the above is wrong information as long as the devices 
involved do flush their caches when requested to.  Zfs still writes 
data in order (at the TXG level) and advances to the next transaction 
group when the devices written to affirm that they have flushed their 
cache.  Without the ZIL, data claimed to be synchronously written 
since the previous transaction group may be entirely lost.


If the devices don't flush their caches appropriately, the ZIL is 
irrelevant to pool corruption.


Bob

Yes Bob is correct - that is exactly how it works.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] bit-flipping in RAM...

2010-03-31 Thread Darren J Moffat



On 31/03/2010 10:27, Erik Trimble wrote:

Orvar's post over in opensol-discuss has me thinking:

After reading the paper and looking at design docs, I'm wondering if
there is some facility to allow for comparing data in the ARC to it's
corresponding checksum. That is, if I've got the data I want in the ARC,
how can I be sure it's correct (and free of hardware memory errors)? I'd
assume the way is to also store absolutely all the checksums for all
blocks/metadatas being read/written in the ARC (which, of course, means
that only so much RAM corruption can be compensated for), and do a
validation when that every time that block is used/written from the ARC.
You'd likely have to do constant metadata consistency checking, and
likely have to hold multiple copies of metadata in-ARC to compensate for
possible corruption. I'm assuming that this has at least been explored,
right?


A subset of this is already done. The ARC keeps its own in memory 
checksum (because some buffers in the ARC are not yet on stable storage 
so don't have a block pointer checksum yet).


http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c

arc_buf_freeze()
arc_buf_thaw()
arc_cksum_verify()
arc_cksum_compute()

It isn't done on every access but it can detect in memory corruption - 
I've seen it happen on several occasions but all due to errors in my 
code not bad physical memory.


Doing in more frequently could cause a significant performance problem.

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] can't destroy snapshot

2010-03-31 Thread Charles Hedrick
We're getting the notorious cannot destroy ... dataset already exists. I've 
seen a number of reports of this, but none of the reports seem to get any 
response. Fortunately this is a backup system, so I can recreate the pool, but 
it's going to take me several days to get all the data back. Is there any known 
workaround?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] can't destroy snapshot

2010-03-31 Thread Charles Hedrick
Incidentally, this is on Solaris 10, but I've seen identical reports from 
Opensolaris.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] can't destroy snapshot

2010-03-31 Thread Bruno Sousa
On 31-3-2010 14:52, Charles Hedrick wrote:
 Incidentally, this is on Solaris 10, but I've seen identical reports from 
 Opensolaris.
   
Probably you need to delete any existing view over the lun you want to
destroy.

Example :

 stmfadm list-lu
LU Name: 600144F0B67340004BB31F060001


stmfadm list-view -l 600144F0B67340004BB323FF0003
View Entry: 0
Host group   : TEST
Target group : All
LUN  : 1

stmfadm remove-view -l 600144F0B67340004BB323FF0003

after this, i think you can zfs destroy zfs_volume .

Bruno




smime.p7s
Description: S/MIME Cryptographic Signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Cannot replace a failed device

2010-03-31 Thread huangruofei
I had a drive fail and replaced it with a new drive,During the resilvering 
process,that show Too many errors,and process fail.
  now,the pool can online,but cannot accept any zfs's commands that change 
pool's state,I can list File directory,but don't mv、cp and rm -f. 
what can I do,I need that data files.


r...@opensolaris2:~# cat /etc/release 
   OpenSolaris 2008.11 snv_101b_rc2 X86
   Copyright 2008 Sun Microsystems, Inc.  All Rights Reserved.
Use is subject to license terms.
   Assembled 19 November 2008

r...@opensolaris2:~# zpool list
NAME SIZE   USED  AVAILCAP  HEALTH  ALTROOT
bfpool  4.98T  3.80T  1.18T76%  ONLINE  -
rpool 74G  5.76G  68.2G 7%  ONLINE  -

r...@opensolaris2:~# zpool status -v bfpool
  pool: bfpool
 state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: resilver in progress for 0h3m, 0.00% done, 1610h45m to go
config:

NAMESTATE READ WRITE CKSUM
bfpool  ONLINE  60 0 0
  c7d0p0ONLINE   0 0 0
  c6d0p0ONLINE   0 0 0
  replacing ONLINE 125 7.19K 0
c5d1p0/old  UNAVAIL  0 7 0  corrupted data
c5d1p0  UNAVAIL  0 7.43K 0  corrupted data
  c4d1p0ONLINE   0 0 0
  c4d0p0ONLINE   0 0 0

errors: Permanent errors have been detected in the following files:

 /bfpool/storm/TDExplorerPlugIn.exe


r...@opensolaris2:~# zpool history
History for 'bfpool':
2009-01-05.15:55:59 zpool create bfpool c7d0p0 c6d0p0 c5d1p0 c4d1p0 c4d0p0
2009-01-05.15:56:32 zfs create bfpool/bofang
2009-01-05.15:56:59 zfs set compression=on bfpool/bofang
2009-01-06.10:24:37 zfs destroy bfpool/bofang
2009-01-06.10:25:41 zfs create bfpool/storm
2009-01-06.10:25:49 zfs create bfpool/temp
2009-01-06.10:27:16 zfs set compression=on bfpool/storm
2009-01-06.10:27:22 zfs set compression=on bfpool/temp
2009-06-26.15:06:06 zfs create bfpool/vc
2009-06-26.15:06:21 zfs create bfpool/hdmedia
2009-06-26.15:06:30 zfs create bfpool/library
2009-06-26.15:06:39 zfs create bfpool/codec
2009-06-26.15:06:46 zfs create bfpool/tools
2009-06-26.15:06:54 zfs create bfpool/software
2009-06-26.15:07:02 zfs create bfpool/opensource
2009-06-26.15:07:11 zfs create bfpool/bbs
2009-06-26.15:07:18 zfs create bfpool/user
2009-06-26.15:07:27 zfs set compression=on bfpool/vc
2009-06-26.15:07:35 zfs set compression=on bfpool/hdmedia
2009-06-26.15:07:43 zfs set compression=on bfpool/library
2009-06-26.15:07:52 zfs set compression=on bfpool/codec
2009-06-26.15:08:01 zfs set compression=on bfpool/tools
2009-06-26.15:08:10 zfs set compression=on bfpool/software
2009-06-26.15:08:18 zfs set compression=on bfpool/opensource
2009-06-26.15:08:26 zfs set compression=on bfpool/bbs
2009-06-26.15:08:33 zfs set compression=on bfpool/user
2010-03-02.15:11:33 zpool replace bfpool c5d1p0

History for 'rpool':
2009-01-05.14:02:50 zpool create -f rpool c5d0s0
2009-01-05.14:02:50 zfs set org.opensolaris.caiman:install=busy rpool
2009-01-05.14:02:50 zfs create -b 4096 -V 1023m rpool/swap
2009-01-05.14:02:50 zfs create -b 131072 -V 1023m rpool/dump
2009-01-05.14:03:19 zfs set mountpoint=/a/export rpool/export
2009-01-05.14:03:19 zfs set mountpoint=/a/export/home rpool/export/home
2009-01-05.14:03:19 zfs set mountpoint=/a/export/home/mike 
rpool/export/home/mike
2009-01-05.14:17:32 zpool set bootfs=rpool/ROOT/opensolaris rpool
2009-01-05.14:18:54 zfs set org.opensolaris.caiman:install=ready rpool
2009-01-05.14:18:55 zfs set mountpoint=/export/home/mike rpool/export/home/mike
2009-01-05.14:18:55 zfs set mountpoint=/export/home rpool/export/home
2009-01-05.14:18:55 zfs set mountpoint=/export rpool/export


r...@opensolaris2:~# zpool get all bfpool
NAMEPROPERTY   VALUE   SOURCE
bfpool  size   4.98T   -
bfpool  used   3.80T   -
bfpool  available  1.18T   -
bfpool  capacity   76% -
bfpool  altroot-   default
bfpool  health ONLINE  -
bfpool  guid   8117798173515948167  -
bfpool  version13  default
bfpool  bootfs -   default
bfpool  delegation on  default
bfpool  autoreplaceoff default
bfpool  cachefile  -   default
bfpool  failmode   waitdefault
bfpool  listsnapshots  off default


 prtdiag -v
System Configuration: MICRO-STAR INTERNATIONAL CO.,LTD MS-7519
BIOS Configuration: American Megatrends Inc. V1.6 09/17/2008

 Processor Sockets 

Version  Location Tag
 --
Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz CPU 1

Re: [zfs-discuss] bit-flipping in RAM...

2010-03-31 Thread Robert Milkowski



On 31/03/2010 10:27, Erik Trimble wrote:

Orvar's post over in opensol-discuss has me thinking:

After reading the paper and looking at design docs, I'm wondering if
there is some facility to allow for comparing data in the ARC to it's
corresponding checksum. That is, if I've got the data I want in the ARC,
how can I be sure it's correct (and free of hardware memory errors)? I'd
assume the way is to also store absolutely all the checksums for all
blocks/metadatas being read/written in the ARC (which, of course, means
that only so much RAM corruption can be compensated for), and do a
validation when that every time that block is used/written from the ARC.
You'd likely have to do constant metadata consistency checking, and
likely have to hold multiple copies of metadata in-ARC to compensate for
possible corruption. I'm assuming that this has at least been explored,
right?


A subset of this is already done. The ARC keeps its own in memory 
checksum (because some buffers in the ARC are not yet on stable 
storage so don't have a block pointer checksum yet).


http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c 



arc_buf_freeze()
arc_buf_thaw()
arc_cksum_verify()
arc_cksum_compute()

It isn't done on every access but it can detect in memory corruption - 
I've seen it happen on several occasions but all due to errors in my 
code not bad physical memory.


Doing in more frequently could cause a significant performance problem.



or there might be an extra zpool level (or system wide) property to 
enable checking checksums onevery access from ARC - there will be a 
siginificatn performance impact but then it might be acceptable for 
really paranoid folks especially with modern hardware.


--
Robert Milkowski
http://milek.blogspot.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Tim Cook
On Wed, Mar 31, 2010 at 6:31 AM, Edward Ned Harvey
solar...@nedharvey.comwrote:

   Nobody knows any way for me to remove my unmirrored
   log device.  Nobody knows any way for me to add a mirror to it (until
 
  Since snv_125 you can remove log devices. See
  http://bugs.opensolaris.org/view_bug.do?bug_id=6574286
 
  I've used this all the time during my testing and was able to remove
  both
  mirrored and unmirrored log devices without any problems (and without
  reboot). I'm using snv_134.

 Aware.  Opensolaris can remove log devices.  Solaris cannot.  Yet.  But if
 you want your server in production, you can get a support contract for
 solaris.  Opensolaris cannot.



According to who?

http://www.opensolaris.com/learn/features/availability/

Full production level support

Both Standard and Premium support offerings are available for deployment of
Open HA Cluster 2009.06 with OpenSolaris 2009.06 with following
configurations:


--Tim
http://www.opensolaris.com/learn/features/availability/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Bob Friesenhahn

On Wed, 31 Mar 2010, Tim Cook wrote:


http://www.opensolaris.com/learn/features/availability/

  Full production level support

Both Standard and Premium support offerings are available for 
deployment of Open HA Cluster 2009.06 with OpenSolaris 2009.06 with 
following configurations:


This formal OpenSolaris release is too anchient to do him any good. 
In fact, zfs-wise, it lags the Solaris 10 releases.


If there is ever another OpenSolaris formal release, then the 
situation will be different.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Bob Friesenhahn

On Wed, 31 Mar 2010, Karsten Weiss wrote:


But frankly at the moment I care the most about the single-threaded case
because if we put e.g. user homes on this server I think they would be
severely disappointed if they would have to wait 2m42s just to extract a rather
small 50 MB tarball. The default 7m40s without SSD log were unacceptable
and we were hoping that the F20 would make a big difference and bring the
performance down to acceptable runtimes. But IMHO 2m42s is still too slow
and disabling the ZIL seems to be the only option.


Is extracting 50 MB tarballs something that your users do quite a lot 
of?  Would your users be concerned if there was a possibility that 
after extracting a 50 MB tarball that files are incomplete, whole 
subdirectories are missing, or file permissions are incorrect?


The Sun Flash Accelerator F20 was not strictly designed as a zfs log 
device.  It was originally designed to be a database accelerator.  It 
was repurposed for zfs slog use because it works.  It is a bit wimpy 
for bulk data.  If you need fast support for bulk writes, perhaps you 
need something like STEC's very expensive ZEUS SSD drive.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread David Magda
On Tue, March 30, 2010 22:40, Edward Ned Harvey wrote:

 Here's a snippet from man zpool.  (Latest version available today in
 solaris)

 zpool remove pool device ...
 Removes the specified device from the pool. This command
 currently  only  supports  removing hot spares and cache
 devices. Devices that are part of a mirrored  configura-
 tion  can  be  removed  using  the zpool detach command.
 Non-redundant and raidz devices cannot be removed from a
 pool.

 So you think it would be ok to shutdown, physically remove the log device,
 and then power back on again, and force import the pool?  So although

A cache device is for the L2ARC, a log device is for ZIL. Log devices
are removable as of snv_125 (mentioned in another e-mail).

If you want log removal in Solaris proper, and you have a support account,
call up and ask that CR 6574286 be fixed:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6574286


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Tim Cook
On Wed, Mar 31, 2010 at 9:47 AM, Bob Friesenhahn 
bfrie...@simple.dallas.tx.us wrote:

 On Wed, 31 Mar 2010, Tim Cook wrote:


 http://www.opensolaris.com/learn/features/availability/

  Full production level support

 Both Standard and Premium support offerings are available for deployment
 of Open HA Cluster 2009.06 with OpenSolaris 2009.06 with following
 configurations:


 This formal OpenSolaris release is too anchient to do him any good. In
 fact, zfs-wise, it lags the Solaris 10 releases.

 If there is ever another OpenSolaris formal release, then the situation
 will be different.

 Bob


Cmon now, have a little faith.  It hasn't even slipped past March yet :)  Of
course it'd be way more fun if someone from Sun threw caution to the wind
and told us what the hold-up is *cough*oracle*cough*.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] bit-flipping in RAM...

2010-03-31 Thread Bob Friesenhahn

On Wed, 31 Mar 2010, Robert Milkowski wrote:


or there might be an extra zpool level (or system wide) property to enable 
checking checksums onevery access from ARC - there will be a siginificatn 
performance impact but then it might be acceptable for really paranoid folks 
especially with modern hardware.


How would this checking take place for memory mapped files?

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Karl Katzke
 Allow me to clarify a little further, why I care about this so much.  I have
 a solaris file server, with all the company jewels on it.  I had a pair of
 intel X.25 SSD mirrored log devices.  One of them failed.  The replacement
 device came with a newer version of firmware on it.  Now, instead of
 appearing as 29.802 Gb, it appears at 29.801 Gb.  I cannot zpool attach.
 New device is too small.
 
 So apparently I'm the first guy this happened to.  Oracle is caught totally
 off guard.  They're pulling their inventory of X25's from dispatch
 warehouses, and inventorying all the firmware versions, and trying to figure
 it all out.  Meanwhile, I'm still degraded.  Or at least, I think I am.

This isn't the only problem that SnOracle has had with the X25s. We managed to 
reproduce a problem with the SSDs as ZIL on an x4250. An I/O error of some sort 
caused a retryable write error ... which brought throughput to 0 as if a PCI 
bus reset had occurred. 

Here's a sample of our output... you might want to check and see if you're 
getting similar errors. 

Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 365881 kern.info] 
/p...@0,0/pci8086,2...@4/pci111d,8...@0/pci111d,8...@4/pci1000,3...@0 (mpt1):
Jan 10 21:36:52 tips-fs1.tamu.edu   Log info 31126000 received for target 
15.
Jan 10 21:36:52 tips-fs1.tamu.edu   scsi_status=0, ioc_status=804b, 
scsi_state=c
Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 365881 kern.info] 
/p...@0,0/pci8086,2...@4/pci111d,8...@0/pci111d,8...@4/pci1000,3...@0 (mpt1):
Jan 10 21:36:52 tips-fs1.tamu.edu   Log info 31126000 received for target 
15.
Jan 10 21:36:52 tips-fs1.tamu.edu   scsi_status=0, ioc_status=804b, 
scsi_state=c
Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci8086,2...@4/pci111d,8...@0/pci111d,8...@4/pci1000,3...@0/s...@f,0 
(sd28):
Jan 10 21:36:52 tips-fs1.tamu.edu   Error for Command: write Error Level: 
Retryable
Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.notice] Requested 
Block: 8448  Error Block: 8448
Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.notice] Vendor: ATA 
   Serial Number: CVEM902401BA
Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.notice] Sense Key: Unit 
Attention
Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.notice] ASC: 0x29 
(power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0


We were lucky to catch the problem before we went live. There were an 
exceptionally large number of I/O errors 

Sun has not gotten back to me with a resolution for this problem yet, but they 
were able to reproduce the issue. 

-K 

Karl Katzke
Systems Analyst II
TAMU / DRGS

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool split problem?

2010-03-31 Thread lori.alt

On 03/31/10 03:50 AM, Damon Atkins wrote:

Why do we still need /etc/zfs/zpool.cache file???
(I could understand it was useful when zfs import was slow)

zpool import is now multi-threaded 
(http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844191), hence a 
lot faster,  each disk contains the hostname 
(http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6282725) , if a 
pool contains the same hostname as the server then import it.

ie This bug should not be a problem any more 
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6737296 with a 
multi-threaded zpool import.

HA Storage should be changed to just do a zpool -h import mypool instead of 
using a private zpool.cache file (-h being ignore if the pool was imported by a 
different host, and maybe a noautoimport property is need on a zpool so 
clustering software can decided to import it by hand as it was)

And therefore this zpool zplit problem would be fixed.
   
The problem with splitting a root pool goes beyond the issue of the 
zpool.cache file.  If you look at the comments for 6939334 
http://monaco.sfbay.sun.com/detail.jsf?cr=6939334, you will see other 
files whose content is not correct when a root pool is renamed or split.


I'm not questioning your logic about whether zpool.cache is still 
needed.  I'm only pointing out that eliminating the zpool.cache file 
would not enable root pools to be split.  More work is required for that.


Lori
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Edward Ned Harvey
 Would your users be concerned if there was a possibility that
 after extracting a 50 MB tarball that files are incomplete, whole
 subdirectories are missing, or file permissions are incorrect?

Correction:  Would your users be concerned if there was a possibility that
after extracting a 50MB tarball *and having a server crash* then files could
be corrupted as described above.

If you disable the ZIL, the filesystem still stays correct in RAM, and the
only way you lose any data such as you've described, is to have an
ungraceful power down or reboot.

The advice I would give is:  Do zfs autosnapshots frequently (say ... every
5 minutes, keeping the most recent 2 hours of snaps) and then run with no
ZIL.  If you have an ungraceful shutdown or reboot, rollback to the latest
snapshot ... and rollback once more for good measure.  As long as you can
afford to risk 5-10 minutes of the most recent work after a crash, then you
can get a 10x performance boost most of the time, and no risk of the
aforementioned data corruption.

Obviously, if you cannot accept 5-10 minutes of data loss, such as credit
card transactions, this would not be acceptable.  You'd need to keep your
ZIL enabled.  Also, if you have a svn server on the ZFS server, and you have
svn clients on other systems ... You should never allow your clients to
advance beyond the current rev of the server.  So again, you'd have to keep
the ZIL enabled on the server.

It all depends on your workload.  For some, the disabled ZIL is worth the
risk.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Bob Friesenhahn

On Wed, 31 Mar 2010, Tim Cook wrote:


If there is ever another OpenSolaris formal release, then the situation will be 
different.

Cmon now, have a little faith.  It hasn't even slipped past March 
yet :)  Of course it'd be way more fun if someone from Sun threw 
caution to the wind and told us what the hold-up is 
*cough*oracle*cough*.


Oracle is a total cold boot for me.  Everything they have put on 
their web site seems carefully designed to cast fear and panic into 
the former Sun customer base and cause substantial doubt, dismay, and 
even terror.  I don't know what I can and can't trust.  Every bit of 
trust that Sun earned with me over the past 19 years is clean-slated.


Regardless, it seems likely that Oracle is taking time to change all 
of the copyrights, documentation, and logos to reflect the new 
othership.  They are probably re-evaluating which parts should be 
included for free in OpenSolaris.  The name Sun is deeply embedded 
in Solaris.  All of the Solaris 10 packages include SUN in their 
name.


Yesterday I noticed that the Sun Studio 12 compiler (used to build 
OpenSolaris) now costs a minimum of $1,015/year.  The Premium 
service plan costs $200 more.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread David Magda
On Wed, March 31, 2010 12:23, Bob Friesenhahn wrote:

 Yesterday I noticed that the Sun Studio 12 compiler (used to build
 OpenSolaris) now costs a minimum of $1,015/year.  The Premium
 service plan costs $200 more.

I feel a great disturbance in the force. It is as if a great multitude of
developers screamed and then went out and downloaded GCC.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Tim Cook
On Wed, Mar 31, 2010 at 11:23 AM, Bob Friesenhahn 
bfrie...@simple.dallas.tx.us wrote:

 On Wed, 31 Mar 2010, Tim Cook wrote:


 If there is ever another OpenSolaris formal release, then the situation
 will be different.

 Cmon now, have a little faith.  It hasn't even slipped past March yet :)
  Of course it'd be way more fun if someone from Sun threw caution to the
 wind and told us what the hold-up is *cough*oracle*cough*.


 Oracle is a total cold boot for me.  Everything they have put on their
 web site seems carefully designed to cast fear and panic into the former Sun
 customer base and cause substantial doubt, dismay, and even terror.  I don't
 know what I can and can't trust.  Every bit of trust that Sun earned with me
 over the past 19 years is clean-slated.

 Regardless, it seems likely that Oracle is taking time to change all of the
 copyrights, documentation, and logos to reflect the new othership.  They are
 probably re-evaluating which parts should be included for free in
 OpenSolaris.  The name Sun is deeply embedded in Solaris.  All of the
 Solaris 10 packages include SUN in their name.

 Yesterday I noticed that the Sun Studio 12 compiler (used to build
 OpenSolaris) now costs a minimum of $1,015/year.  The Premium service plan
 costs $200 more.

 Bob
 --
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/



Where did you see that?  It looks to be free to me:
Sun Studio 12 Update 1 - FREE for SDN members.

SDN members can download a free, full-license copy of Sun Studio 12 Update
1.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Chris Ridd
On 31 Mar 2010, at 17:23, Bob Friesenhahn wrote:

 Yesterday I noticed that the Sun Studio 12 compiler (used to build 
 OpenSolaris) now costs a minimum of $1,015/year.  The Premium service plan 
 costs $200 more.

The download still seems to be a free, full-license copy for SDN members; the 
$1015 you quote is for the standard Sun Software service plan. Is a service 
plan now *required*, a la Solaris 10?

Cheers,

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool split problem?

2010-03-31 Thread Richard Elling
On Mar 31, 2010, at 2:50 AM, Damon Atkins wrote:

 Why do we still need /etc/zfs/zpool.cache file??? 
 (I could understand it was useful when zfs import was slow)

Yes. Imagine the case where your server has access to hundreds of LUs.
If you must probe each one, then booting can take a long time. If you go
back in history you will find many cases where probing all LUs at boot was
determined to be a bad thing.

 zpool import is now multi-threaded 
 (http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844191), hence a 
 lot faster,  each disk contains the hostname 
 (http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6282725) , if a 
 pool contains the same hostname as the server then import it.
 
 ie This bug should not be a problem any more 
 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6737296 with a 
 multi-threaded zpool import.
 
 HA Storage should be changed to just do a zpool -h import mypool instead of 
 using a private zpool.cache file (-h being ignore if the pool was imported by 
 a different host, and maybe a noautoimport property is need on a zpool so 
 clustering software can decided to import it by hand as it was)
 
 And therefore this zpool zplit problem would be fixed.

There is also a use case where the storage array makes a block-level
copy of a LU. It would be a bad thing to discover that on a probe and
attempt import.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool split problem?

2010-03-31 Thread Frank Middleton

On 03/31/10 12:21 PM, lori.alt wrote:


The problem with splitting a root pool goes beyond the issue of the
zpool.cache file. If you look at the comments for 6939334
http://monaco.sfbay.sun.com/detail.jsf?cr=6939334, you will see other
files whose content is not correct when a root pool is renamed or split.


6939334 seems to be inaccessible outside of Sun. Could you
list the comments here?

Thanks
 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Tim Cook
On Wed, Mar 31, 2010 at 11:39 AM, Chris Ridd chrisr...@mac.com wrote:

 On 31 Mar 2010, at 17:23, Bob Friesenhahn wrote:

  Yesterday I noticed that the Sun Studio 12 compiler (used to build
 OpenSolaris) now costs a minimum of $1,015/year.  The Premium service plan
 costs $200 more.

 The download still seems to be a free, full-license copy for SDN members;
 the $1015 you quote is for the standard Sun Software service plan. Is a
 service plan now *required*, a la Solaris 10?

 Cheers,

 Chris



It's still available in the opensolaris repo, and I see no license reference
stating you have to have a support contract, so I'm guessing no...

*Several releases of Sun Studio Software are available in the OpenSolaris
repositories. The following list shows you how to download and install each
release, and where you can find the documentation for the release:*

   - *Sun Studio 12 Update 1:** The Sun Studio 12 Update 1 release is the
   latest full production release of Sun Studio software. It has recently been
   added to the OpenSolaris IPS repository.

   To install this release in your OpenSolaris 2009.06 environment using the
   Package Manager:*

*
*
--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] VMware client solaris 10, RAW physical disk and zfs snapshots problem - all created snapshots are equal to zero.

2010-03-31 Thread Edward Ned Harvey
 I did those test and here are results:
 
 r...@sl-node01:~# zfs list
 NAMEUSED  AVAIL  REFER  MOUNTPOINT
 mypool01   91.9G   136G23K  /mypool01
 mypool01/storage01 91.9G   136G  91.7G  /mypool01/storage01
 mypool01/storag...@30032010-1  0  -  91.9G  -
 mypool01/storag...@30032010-2  0  -  91.9G  -
 mypool01/storag...@30032010-3  2.15M  -  91.7G  -
 mypool01/storag...@30032010-441K  -  91.7G  -
 mypool01/storag...@30032010-5  1.17M  -  91.7G  -
 mypool01/storag...@30032010-6  0  -  91.7G  -
 mypool02   91.9G   137G24K  /mypool02
 mypool02/copies  23K   137G23K  /mypool02/copies
 mypool02/storage01 91.9G   137G  91.9G  /mypool02/storage01
 mypool02/storag...@30032010-1  0  -  91.9G  -
 mypool02/storag...@30032010-2  0  -  91.9G  -
 
 As you can see I have differences for snapshot 4,5 and 6 as you
 suggested to make a test. But I can see also changes on snapshot no. 3
 - I complain about this snapshot because I could not see differences
 on it last night! Now it shows.

Well, the first thing you should know is this:  Suppose you take a snapshot,
and create some files.  Then the snapshot still occupies no disk space.
Everything is in the current filesystem.  The only time a snapshot occupies
disk space is when the snapshot contains data that is missing from the
current filesystem.  That is - If you rm or overwrite some files in the
current filesystem, then you will see the size of the snapshot growing.
Make sense?

That brings up a question though.  If you did the commands as I wrote them,
it would mean you created a 1G file, took a snapshot, and rm'd the file.
Therefore your snapshot should contain at least 1G.  I am confused by the
fact that you only have 1-2M in your snapshot.  Maybe I messed up the
command I told you, or you messed up entering it on the system, and you only
created a 1M file, instead of a 1G file?


 What is still strange: snapshots 1 and 2 are the oldest but they are
 still equal to zero! After changes and snapshots 3,4,5 and 6 I would
 expect that snapshots 1 and 2 are recording also changes on the
 storage01 file system, but not... could it be possible that snapshots
 1 and 2 are somehow broken?

If some file existed during all of the old snapshots, and you destroy your
later snapshots, then the data occupied by the later snapshots will start to
fall onto the older snapshots.  Until you destroy the oldest snapshot that
contained that data.  At which time, the data is truly gone from all of the
snapshots.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Need advice on handling 192 TB of Storage on hardware raid storage

2010-03-31 Thread Richard Elling
On Mar 31, 2010, at 2:05 AM, Dedhi Sujatmiko wrote:
 Dear all,
 
 I have a hardware based array storage with a capacity of 192TB and being 
 sliced into 64 LUNs of 3TB.
 What will be the best way to configure the ZFS on this? Of course we are not 
 requiring the self healing capability of the ZFS. We just want the capability 
 of handling big size file system and performance.

Answers below based on the assumption that you value performance over space 
over 
dependability.

 Currently we are running using Solaris 10 May 2009 (Update 7), and configure 
 the ZFS where :

First, upgrade or patch to the latest Solaris 10 kernel/zfs bits.

 a. 1 hardware LUN (3TB) will become 1 zpool

The RAID configuration of the LUs will be critical. ZFS can be easily configured
to overrun most RAID arrays using modest server hardware.

 b. 1 zpool will become 1 ZFS file system
 c. 1 ZFS file system will become 1 mountpoint (obviously).

I see no reason to do this.  For best performance, put multiple LUs into the 
pool.

 The problem we have is that when the customer runs the I/O in parallel to the 
 64 file systems, the kernel usage (%sys) shot up very high to the 90% region 
 and the IOPS level is degrading. It can be seen also that during that time 
 the storage's own front end CPU does not change much, which means the 
 bottleneck is not on the hardware storage level, but somewhere inside the 
 Solaris box.

The cause of the high system time should be investigated.  I have seen huge
amounts of I/O to RAID arrays consume relatively little system time.

 Is there any experience of having the similar setup like the one I have? Or 
 anybody can point me to an information on what will be the best way to deal 
 with the hardware storage on this size?

In general, spread the I/O across all resources to get the best overall 
response time.

 Please advice and thanks in advance

HTH,
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Bob Friesenhahn

On Wed, 31 Mar 2010, Chris Ridd wrote:

Yesterday I noticed that the Sun Studio 12 compiler (used to build 
OpenSolaris) now costs a minimum of $1,015/year.  The Premium 
service plan costs $200 more.


The download still seems to be a free, full-license copy for SDN 
members; the $1015 you quote is for the standard Sun Software 
service plan. Is a service plan now *required*, a la Solaris 10?


There is no telling.  Everything is subject to evaluation by Oracle 
and it is not clear which parts of the web site are confirmed and 
which parts are still subject to change.  In the past it was free to 
join SDN but if one was to put an 'M' in front of that SDN, then there 
would be a subtantial yearly charge for membership (up to $10,939 USD 
per year according to Wikipedia).  This is a world that Oracle has 
been commonly exposed to in the past.  Not everyone who uses a 
compiler qualifies as a developer.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Chris Ridd

On 31 Mar 2010, at 17:50, Bob Friesenhahn wrote:

 On Wed, 31 Mar 2010, Chris Ridd wrote:
 
 Yesterday I noticed that the Sun Studio 12 compiler (used to build 
 OpenSolaris) now costs a minimum of $1,015/year.  The Premium service 
 plan costs $200 more.
 
 The download still seems to be a free, full-license copy for SDN members; 
 the $1015 you quote is for the standard Sun Software service plan. Is a 
 service plan now *required*, a la Solaris 10?
 
 There is no telling.  Everything is subject to evaluation by Oracle and it is 
 not clear which parts of the web site are confirmed and which parts are still 
 subject to change.  In the past it was free to join SDN but if one was to put 
 an 'M' in front of that SDN, then there would be a subtantial yearly charge 
 for membership (up to $10,939 USD per year according to Wikipedia).  This is 
 a world that Oracle has been commonly exposed to in the past.  Not everyone 
 who uses a compiler qualifies as a developer.

Indeed, but Microsoft still give out free express versions of their tools. If 
memory serves, you're not allowed to distribute binaries built with them but 
otherwise they're not broken in any significant way.

Maybe this will also be the difference between Sun Studio and Sun Studio 
Express.

Perhaps we should take this to tools-compilers.

Cheers,

Chris


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool split problem?

2010-03-31 Thread lori.alt

On 03/31/10 10:42 AM, Frank Middleton wrote:

On 03/31/10 12:21 PM, lori.alt wrote:


The problem with splitting a root pool goes beyond the issue of the
zpool.cache file. If you look at the comments for 6939334
http://monaco.sfbay.sun.com/detail.jsf?cr=6939334, you will see other
files whose content is not correct when a root pool is renamed or split.


6939334 seems to be inaccessible outside of Sun. Could you
list the comments here?

Thanks



Here they are:


Other issues:

* Swap is still pointing to rpool because /etc/vfstab is never updated.

* Likewise, dumpadm still has dump zvols configured with the original pool.

* The /{pool}/boot/menu.lst (on sparc), and /{pool}/boot/grub/menu.lst (on x86) 
still reference the original pool's bootfs.  Note that the 'bootfs' property in 
the pool itself is actually correct, because we store the object number and not 
the name.


While each one of these issues is individually fixable, there's no way to 
prevent new issues coming up in the future, thus breaking zpool split.  It 
might be more advisable to prevent splitting of root pools.

*** (#2 of 3): 2010-03-30 18:48:54 GMT+00:00mark.musa...@sun.com

yes, these looks like the kind of issues that flash archive install had to 
solve:  all the tweaks that need to be made to a root file system to get it to 
adjust to  living on different hardware.  In addition to the ones listed above, 
there are all the device specific files in /etc/path_to_inst, /devices, and so 
on.  This is not a trivial problem.  Cloning root pools by the split mechanism 
is more of a project in its own right.  Is zfs split good for anything related 
to root disks?   I can't think of a use.  If there is a need for a disaster 
recovery disk, it's probably best to just remove one of the mirrors (without 
doing a split operation) and stash it for later use.

*** (#3 of 3): 2010-03-30 20:21:57 GMT+00:00lori@sun.com


   


Lori


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] *SPAM* Re: zfs send/receive - actual performance

2010-03-31 Thread Kyle McDonald
On 3/27/2010 3:14 AM, Svein Skogen wrote:
 On 26.03.2010 23:55, Ian Collins wrote:
  On 03/27/10 09:39 AM, Richard Elling wrote:
  On Mar 26, 2010, at 2:34 AM, Bruno Sousa wrote:

  Hi,
 
  The jumbo-frames in my case give me a boost of around 2 mb/s, so it's
  not that much.
   
  That is about right.  IIRC, the theoretical max is about 4%
  improvement, for MTU of 8KB.
 

  Now i will play with link aggregation and see how it goes, and of
  course i'm counting that incremental replication will be slower...but
  since the amount of data would be much less probably it will still
  deliver a good performance.
   
  Probably won't help at all because of the brain dead way link
  aggregation has to
  work.  See Ordering of frames at
 
 http://en.wikipedia.org/wiki/Link_Aggregation_Control_Protocol#Link_Aggregation_Control_Protocol
 
 
 
  Arse, thanks for reminding me Richard! A single stream will only use one
  path in a LAG.

 Doesn't (Open)Solaris have the option of setting the aggregate up as a
 FEC or in roundrobin mode?

Solaris does offer what the Wiki describes as  L4 or port number based
hashing.
I'm not sure what FEC is, but when I asked, round-robin isn't available
as preserving packet ordering wouldn't be easy (possible?) that way.

  -Kyle


 //Svein

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] can't destroy snapshot

2010-03-31 Thread Ian Collins

On 04/ 1/10 01:51 AM, Charles Hedrick wrote:

We're getting the notorious cannot destroy ... dataset already exists. I've 
seen a number of reports of this, but none of the reports seem to get any response. 
Fortunately this is a backup system, so I can recreate the pool, but it's going to take 
me several days to get all the data back. Is there any known workaround?
   

Exactly what commands are you running and what errors do you see?

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Miles Nordin
 rm == Robert Milkowski mi...@task.gda.pl writes:

rm This is not true. If ZIL device would die *while pool is
rm imported* then ZFS would start using z ZIL withing a pool and
rm continue to operate.

what you do not say, is that a pool with dead zil cannot be 
'import -f'd.  So, for example, if your rpool and slog are on the same
SSD, and it dies, you have just lost your whole pool.


pgp9E0wFxqcc4.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Miles Nordin
 rm == Robert Milkowski mi...@task.gda.pl writes:

rm the reason you get better performance out of the box on Linux
rm as NFS server is that it actually behaves like with disabled
rm ZIL

careful.

Solaris people have been slinging mud at linux for things unfsd did in
spite of the fact knfsd has been around for a decade.  and ``has
options to behave like the ZIL is disabled (sync/async in
/etc/exports)'' != ``always behaves like the ZIL is disabled''.

If you are certain about Linux NFS servers not preserving data for
hard mounts when the server reboots even with the 'sync' option which
is the default, please confirm, but otherwise I do not believe you.

rm Which is an expected behavior when you break NFS requirements
rm as Linux does out of the box.

wrong.  The default is 'sync' in /etc/exports.  The default has
changed, but the default is 'sync', and the whole thing is
well-documented.

rm What would be useful though is to be able to easily disable
rm ZIL per dataset instead of OS wide switch.

yeah, Linux NFS servers have that granularity for their equivalent
option.


pgpg1qLhwVTDs.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Wes Felter

Karsten Weiss wrote:


Knowing that 100s of users could do this in parallel with good performance
is nice but it does not improve the situation for the single user which only
cares for his own tar run. If there's anything else we can do/try to improve
the single-threaded case I'm all ears.


A MegaRAID card with write-back cache? It should also be cheaper than 
the F20.


Wes Felter

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] can't destroy snapshot

2010-03-31 Thread Charles Hedrick
# zfs destroy -r OIRT_BAK/backup_bad
cannot destroy 'OIRT_BAK/backup_...@annex-2010-03-23-07:04:04-bad': dataset 
already exists


No, there are no clones.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAIDZ2 configuration

2010-03-31 Thread Cindy Swearingen

Hi Ned,

If you look at the examples on the page that you cite, they start
with single-parity RAIDZ examples and then move to double-parity RAIDZ
example with supporting text, here:

http://docs.sun.com/app/docs/doc/819-5461/gcvjg?a=view

Can you restate the problem with this page?

Thanks,

Cindy


On 03/26/10 05:42, Edward Ned Harvey wrote:
Just because most people are probably too lazy to click the link, I’ll 
paste a phrase from that sun.com webpage below:


“Creating a single-parity RAID-Z pool is identical to creating a 
mirrored pool, except that the ‘raidz’ or ‘raidz1’ keyword is used 
instead of ‘mirror’.”


And

“zpool create tank raidz2 c1t0d0 c2t0d0 c3t0d0”

 

So … Shame on you, Sun, for doing this to your poor unfortunate 
readers.  It would be nice if the page were a wiki, or somehow able to 
have feedback submitted…


 

 

 

*From:* zfs-discuss-boun...@opensolaris.org 
[mailto:zfs-discuss-boun...@opensolaris.org] *On Behalf Of *Bruno Sousa

*Sent:* Thursday, March 25, 2010 3:28 PM
*To:* Freddie Cash
*Cc:* ZFS filesystem discussion list
*Subject:* Re: [zfs-discuss] RAIDZ2 configuration

 

Hmm...it might be completely wrong , but the idea of raidz2 vdev with 3 
disks came from the reading of 
http://docs.sun.com/app/docs/doc/819-5461/gcvjg?a=view .


This particular page has the following example :

*zpool create tank raidz2 c1t0d0 c2t0d0 c3t0d0*

# *zpool status -v tank*

  pool: tank

 state: ONLINE

 scrub: none requested

config:

 


NAME  STATE READ WRITE CKSUM

tank  ONLINE   0 0 0

  raidz2  ONLINE   0 0 0

c1t0d0ONLINE   0 0 0

c2t0d0ONLINE   0 0 0

c3t0d0ONLINE   0 0 0

 

So...what am i missing here? Just a bad example in the sun documentation 
regarding zfs?


Bruno

On 25-3-2010 20:10, Freddie Cash wrote:

On Thu, Mar 25, 2010 at 11:47 AM, Bruno Sousa bso...@epinfante.com 
mailto:bso...@epinfante.com wrote:


What do you mean by Using fewer than 4 disks in a raidz2 defeats the 
purpose of raidz2, as you will always be in a degraded mode ? Does it 
means that having 2 vdevs with 3 disks it won't be redundant in the 
advent of a drive failure?


 

raidz1 is similar to raid5 in that it is single-parity, and requires a 
minimum of 3 drives (2 data + 1 parity)


raidz2 is similar to raid6 in that it is double-parity, and requires a 
minimum of 4 drives (2 data + 2 parity)


 

IOW, a raidz2 vdev made up of 3 drives will always be running in 
degraded mode (it's missing a drive).


 


--

Freddie Cash
fjwc...@gmail.com mailto:fjwc...@gmail.com

--
This message has been scanned for viruses and
dangerous content by *MailScanner* http://www.mailscanner.info/, and is
believed to be clean.

 

 


___

zfs-discuss mailing list

zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org

http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

  

 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Stuart Anderson
Edward Ned Harvey solaris2 at nedharvey.com writes:

 
 Allow me to clarify a little further, why I care about this so much.  I have
 a solaris file server, with all the company jewels on it.  I had a pair of
 intel X.25 SSD mirrored log devices.  One of them failed.  The replacement
 device came with a newer version of firmware on it.  Now, instead of
 appearing as 29.802 Gb, it appears at 29.801 Gb.  I cannot zpool attach.
 New device is too small.
 
 So apparently I'm the first guy this happened to.  Oracle is caught totally
 off guard.  They're pulling their inventory of X25's from dispatch
 warehouses, and inventorying all the firmware versions, and trying to figure
 it all out.  Meanwhile, I'm still degraded.  Or at least, I think I am.
 
 Nobody knows any way for me to remove my unmirrored log device.  Nobody
 knows any way for me to add a mirror to it (until they can locate a drive
 with the correct firmware.)  All the support people I have on the phone are
 just as scared as I am.  Well we could upgrade the firmware of your
 existing drive, but that'll reduce it by 0.001 Gb, and that might just
 create a time bomb to destroy your pool at a later date.  So we don't do
 it.
 
 Nobody has suggested that I simply shutdown and remove my unmirrored SSD,
 and power back on.
 

We ran into something similar with these drives in an X4170 that turned out to
be  an issue of the preconfigured logical volumes on the drives. Once we made
sure all of our Sun PCI HBAs where running the exact same version of firmware
and recreated the volumes on new drives arriving from Sun we got back into sync
on the X25-E devices sizes.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAIDZ2 configuration

2010-03-31 Thread Bruno Sousa
Hi Cindy,

This all issue started when i asked opinion in this list in how should i
create zpools. It seems that one of my initial ideas of creating a vdev
with 3 disks in a raidz configuration seems to be a non-sense configuration.
Somewhere along the way i defended my initial idea with the fact that
the documentation from Sun has as an example such configuration as seen
here :


*zpool create tank raidz2 c1t0d0 c2t0d0 c3t0d0*  at
http://docs.sun.com/app/docs/doc/819-5461/gcvjg?a=view

So if by concept the idea of having a vdev with 3 disks within a raidz
configuration is a bad one, the oficial Sun documentation should not
have such example. However if people made such example in Sun
documentation, perhaps this all idea is not that bad at all..

Can you provide anything on this subject?

Thanks,
Bruno




On 31-3-2010 23:49, Cindy Swearingen wrote:
 Hi Ned,

 If you look at the examples on the page that you cite, they start
 with single-parity RAIDZ examples and then move to double-parity RAIDZ
 example with supporting text, here:

 http://docs.sun.com/app/docs/doc/819-5461/gcvjg?a=view

 Can you restate the problem with this page?

 Thanks,

 Cindy


 On 03/26/10 05:42, Edward Ned Harvey wrote:
 Just because most people are probably too lazy to click the link,
 I’ll paste a phrase from that sun.com webpage below:

 “Creating a single-parity RAID-Z pool is identical to creating a
 mirrored pool, except that the ‘raidz’ or ‘raidz1’ keyword is used
 instead of ‘mirror’.”

 And

 “zpool create tank raidz2 c1t0d0 c2t0d0 c3t0d0”

  

 So … Shame on you, Sun, for doing this to your poor unfortunate
 readers.  It would be nice if the page were a wiki, or somehow able
 to have feedback submitted…

  

  

  

 *From:* zfs-discuss-boun...@opensolaris.org
 [mailto:zfs-discuss-boun...@opensolaris.org] *On Behalf Of *Bruno Sousa
 *Sent:* Thursday, March 25, 2010 3:28 PM
 *To:* Freddie Cash
 *Cc:* ZFS filesystem discussion list
 *Subject:* Re: [zfs-discuss] RAIDZ2 configuration

  

 Hmm...it might be completely wrong , but the idea of raidz2 vdev with
 3 disks came from the reading of
 http://docs.sun.com/app/docs/doc/819-5461/gcvjg?a=view .

 This particular page has the following example :

 *zpool create tank raidz2 c1t0d0 c2t0d0 c3t0d0*

 # *zpool status -v tank*

   pool: tank

  state: ONLINE

  scrub: none requested

 config:

  

 NAME  STATE READ WRITE CKSUM

 tank  ONLINE   0 0 0

   raidz2  ONLINE   0 0 0

 c1t0d0ONLINE   0 0 0

 c2t0d0ONLINE   0 0 0

 c3t0d0ONLINE   0 0 0

  

 So...what am i missing here? Just a bad example in the sun
 documentation regarding zfs?

 Bruno

 On 25-3-2010 20:10, Freddie Cash wrote:

 On Thu, Mar 25, 2010 at 11:47 AM, Bruno Sousa bso...@epinfante.com
 mailto:bso...@epinfante.com wrote:

 What do you mean by Using fewer than 4 disks in a raidz2 defeats the
 purpose of raidz2, as you will always be in a degraded mode ? Does
 it means that having 2 vdevs with 3 disks it won't be redundant in
 the advent of a drive failure?

  

 raidz1 is similar to raid5 in that it is single-parity, and requires
 a minimum of 3 drives (2 data + 1 parity)

 raidz2 is similar to raid6 in that it is double-parity, and requires
 a minimum of 4 drives (2 data + 2 parity)

  

 IOW, a raidz2 vdev made up of 3 drives will always be running in
 degraded mode (it's missing a drive).

  

 -- 

 Freddie Cash
 fjwc...@gmail.com mailto:fjwc...@gmail.com

 -- 
 This message has been scanned for viruses and
 dangerous content by *MailScanner* http://www.mailscanner.info/,
 and is
 believed to be clean.

  

  

 ___

 zfs-discuss mailing list

 zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org

 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

  
  


 

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss





smime.p7s
Description: S/MIME Cryptographic Signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Robert Milkowski

On 31/03/2010 17:31, Bob Friesenhahn wrote:

On Wed, 31 Mar 2010, Edward Ned Harvey wrote:


Would your users be concerned if there was a possibility that
after extracting a 50 MB tarball that files are incomplete, whole
subdirectories are missing, or file permissions are incorrect?


Correction:  Would your users be concerned if there was a 
possibility that
after extracting a 50MB tarball *and having a server crash* then 
files could

be corrupted as described above.

If you disable the ZIL, the filesystem still stays correct in RAM, 
and the

only way you lose any data such as you've described, is to have an
ungraceful power down or reboot.


Yes, of course.  Suppose that you are a system administrator.  The 
server spontaneously reboots.  A corporate VP (CFO) comes to you and 
says that he had just saved the critical presentation to be given to 
the board of the company (and all shareholders) later that day, and 
now it is gone due to your spontaneous server reboot.  Due to a 
delayed financial statement, the corporate stock plummets.  What are 
you to do?  Do you expect that your employment will continue?


Reliable NFS synchronous writes are good for the system administrators.


well, it really depends on your environment.
There is place for Oracle database and there is place for MySQL, then 
you don't really need to cluster everything and then there are 
environments where disabling ZIL is perfectly acceptablt.


One of such cases is that you need to re-import a database or recover 
lots of files over NFS - your service is down and disabling ZIL makes a 
recovery MUCH faster. Then there are cases when leaving the ZIL disabled 
is acceptable as well.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Robert Milkowski

On 31/03/2010 17:22, Edward Ned Harvey wrote:


The advice I would give is:  Do zfs autosnapshots frequently (say ... every
5 minutes, keeping the most recent 2 hours of snaps) and then run with no
ZIL.  If you have an ungraceful shutdown or reboot, rollback to the latest
snapshot ... and rollback once more for good measure.  As long as you can
afford to risk 5-10 minutes of the most recent work after a crash, then you
can get a 10x performance boost most of the time, and no risk of the
aforementioned data corruption.
   


I don't really get it - rolling back to a last snapshot doesn't really 
improve things here it actually makes it worse as now you are going to 
loose even more data. Keep in mind that currently the maximum time after 
which ZFS commits a transaction is 30s - ZIL or not. So with disabled 
ZIL in worst case scenario you should loose no more than last 30-60s. 
You can tune it down if you want. Rolling back to a snapshot will only 
make it worse. Then also keep in mind that it is a worst case scenario 
here - it well may be there were no outstanding transactions at all - it 
all goes down basically to a risk assessment, impact assessment and a cost.


Unless you are talking about doing regular snapshots and making sure 
that application is consistent while doing so - for example putting all 
Oracle tablespaces in a hot backup mode and taking a snapshot... 
otherwise it doesn't really make sense.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] bit-flipping in RAM...

2010-03-31 Thread Robert Milkowski

On 31/03/2010 16:44, Bob Friesenhahn wrote:

On Wed, 31 Mar 2010, Robert Milkowski wrote:


or there might be an extra zpool level (or system wide) property to 
enable checking checksums onevery access from ARC - there will be a 
siginificatn performance impact but then it might be acceptable for 
really paranoid folks especially with modern hardware.


How would this checking take place for memory mapped files?



Well, and it wouldn't help if data were corrupted in an application 
internal buffer after read() succeeded, or just before an application 
does a write().


So I wasn't saying that it can work or that it can work in all 
circumstances but rather I was trying to say that it probably shouldn't 
be dismissed on a performance argument alone as for some use cases with 
modern HW it might well be that the performance will still be acceptable 
while providing still better protection and data correctness guarantee.


But even then while mmap() issue is probably solvable the read() and 
write() cases are probably not.


--
Robert Milkowski
http://milek.blogspot.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Robert Milkowski

On 31/03/2010 21:38, Miles Nordin wrote:

 rm  Which is an expected behavior when you break NFS requirements
 rm  as Linux does out of the box.

wrong.  The default is 'sync' in /etc/exports.  The default has
changed, but the default is 'sync', and the whole thing is
well-documented.
   


I double checked the documentation and you're right - the default has 
changed to sync.
I haven't found in which RH version it happened but it doesn't really 
matter.


So yes, I was wrong - the current default it seems to be sync on Linux 
as well.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] benefits of zfs root over ufs root

2010-03-31 Thread Brett
Hi Folks,

Im in a shop thats very resistant to change. The management here are looking 
for major justification of a move away from ufs to zfs for root file systems. 
Does anyone know if there are any whitepapers/blogs/discussions extolling the 
benefits of zfsroot over ufsroot?

Regards in advance
Rep
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] bit-flipping in RAM...

2010-03-31 Thread Xin LI
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 2010/03/31 05:13, Darren J Moffat wrote:
 On 31/03/2010 10:27, Erik Trimble wrote:
 Orvar's post over in opensol-discuss has me thinking:

 After reading the paper and looking at design docs, I'm wondering if
 there is some facility to allow for comparing data in the ARC to it's
 corresponding checksum. That is, if I've got the data I want in the ARC,
 how can I be sure it's correct (and free of hardware memory errors)? I'd
 assume the way is to also store absolutely all the checksums for all
 blocks/metadatas being read/written in the ARC (which, of course, means
 that only so much RAM corruption can be compensated for), and do a
 validation when that every time that block is used/written from the ARC.
 You'd likely have to do constant metadata consistency checking, and
 likely have to hold multiple copies of metadata in-ARC to compensate for
 possible corruption. I'm assuming that this has at least been explored,
 right?
 
 A subset of this is already done. The ARC keeps its own in memory
 checksum (because some buffers in the ARC are not yet on stable storage
 so don't have a block pointer checksum yet).
 
 http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c
 
 
 arc_buf_freeze()
 arc_buf_thaw()
 arc_cksum_verify()
 arc_cksum_compute()
 
 It isn't done on every access but it can detect in memory corruption -
 I've seen it happen on several occasions but all due to errors in my
 code not bad physical memory.
 
 Doing in more frequently could cause a significant performance problem.

Agreed.

I think it's probably not a very good idea to check it everywhere.  It
would be great if we can do some checks occasionally especially for
critical data structures, but, if it's the memory we can not trust, how
can we trust that the checksum checker to behave correctly?

I had some questions about the FAST paper mentioned by Erik, which was
not answered during the conference which makes me feel that the paper,
while pointed out some interesting issues, but failed to prove it being
a real world problem:

 - How much probability a bit flipping can happen on a non-ECC system?
say, how much bits would be flipped per terabytes processed, or
transactions or something?
 - Among these flipped bits, how much would happen on a file system
buffer?  What happens when, say, the application's memory hit a flipped
bit, and when the file system itself have no problem with its buffer?
 - How much performance penalty would be if we check the checksums every
time the data is being accessed?  How good will the check be compared to
an ECC in terms of correctness?

Cheers,
- -- 
Xin LI delp...@delphij.nethttp://www.delphij.net/
FreeBSD - The Power to Serve!  Live free or die
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.14 (FreeBSD)

iQEcBAEBAgAGBQJLs+UZAAoJEATO+BI/yjfBfE0H/0+iG/pgrs/JNId814g5JMki
eZ2tJx2Lf7+DIlrHczvcwyWAtAke7ojUMeNEw6HIqMfTQHVcgMk2XNdxWZn0sJsy
PUPj9Qcg+nkHcewAoWvG0VUZN0fSBX1OtJcVG78Kt5drWmT+g5jiMH+BFCEAiISJ
Kcfswp9r0JbYmI010fwqugc74bAZnMhUXMCvvplJZUE3iaDCq499TanKIVmKu4vq
JsDNYXZT9Nqbb20DB4TKluauP1QVUJnBAeqfQCYZ/+CqK5+phnUgzyaBTiMKBHd0
Q0l1bvGEvjLRarlGk7/702Udu7HC4UKs09pKtBIb+cw8CmyYaZ8Vuth0Ri0drzM=
=S5WS
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool split problem?

2010-03-31 Thread Damon Atkins
I assume the swap, dumpadm, grub is because the pool has a different name now, 
but is it still a problem if you take it to a *different system* boot off a CD 
change it back to rpool. (which is most likley unsupported, ie no help to get 
it working)

Over 10 years ago (way before flash archive existed)  I developed a script, 
used after spliting a mirror, which would remove most of the device tree, 
cleaned up path_to_inst etc so it look like the OS was just installed and about 
to do the reboot without the install CD. (every thing was still in there expect 
for hardware specific stuff, I no longer have the script and most likey would 
not do it again because its not a supported install method)

I still had to boot from CD on the new system and create the dev tree before 
booting off the disk for the first time, and then fix vfstab (but the fix 
vfstab should be gone with zfs rpool)

It would be nice for Oracle/Sun to produce a separate script which reset 
system/devices  back to a install like begining so if you move a OS disk with 
current password file and software from one system to another, and have it 
rebuild the device tree on the new system.

From member (updated for zfs) something like:
zfs split rpool newrpool
mount newrpool
remove newrpool/dev and newrpool/devices of all non-packages content (ie 
dynamically created content)
clean up newrpool/etc/path_to_inst
create /newrool/reconfigure
remove all prevoius snapshots in newrool
update beadm info inside newrpool
ensure grub is installed on the disk
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread David Magda

On Mar 31, 2010, at 19:41, Robert Milkowski wrote:

I double checked the documentation and you're right - the default  
has changed to sync.
I haven't found in which RH version it happened but it doesn't  
really matter.


From the SourceForge site:

Since version 1.0.1 of the NFS utilities tarball has changed the  
server export default to sync, then, if no behavior is specified  
in the export list (thus assuming the default behavior), a warning  
will be generated at export time.


http://nfs.sourceforge.net/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] benefits of zfs root over ufs root

2010-03-31 Thread Erik Trimble

Brett wrote:

Hi Folks,

Im in a shop thats very resistant to change. The management here are looking 
for major justification of a move away from ufs to zfs for root file systems. 
Does anyone know if there are any whitepapers/blogs/discussions extolling the 
benefits of zfsroot over ufsroot?

Regards in advance
Rep
  

I can't give you any links, but here's a short list of advantages:

(1) all the standard ZFS advantages over UFS
(2) LiveUpgrade/beadm related improvements
  (a)  much faster on ZFS
  (b)  don't need dedicated slice per OS instance, so it's far 
simpler to have N different OS installs
  (c)  very easy to keep track of which OS instance is installed 
where WITHOUT having to mount each one

  (d)  huge space savings (snapshots save lots of space on upgrades)
(3) much more flexible swap space allocation (no hard-boundary slices)
(4) simpler layout of filesystem partitions, and more flexible in 
changing directory size limits (e.g. /var )
(5) mirroring a boot disk is simple under ZFS - much more complex under 
SVM/UFS

(6) root-pool snapshots make backups trivially easy



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] can't destroy snapshot

2010-03-31 Thread Charles Hedrick
So we tried recreating the pool and sending the data again.

1) compression wasn't set on the copy, even though I did sent -R, which is 
supposed to send all properties
2) I tried killing to send | receive pipe. Receive couldn't be killed. It hung.
3) This is Solaris Cluster. We tried forcing a failover. The pool mounted on 
the other server without dismounting on the first. zpool list showed it mounted 
on both machines. zpool iostat showed I/O actually occurring on both systems.

Altogether this does not give me a good feeling about ZFS. I'm hoping the 
problem is just with receive and CLuster, and the it works properly on a single 
system. Because i'm running a critical database on ZFS on another system.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] bit-flipping in RAM...

2010-03-31 Thread Daniel Carosone
On Thu, Apr 01, 2010 at 12:38:29AM +0100, Robert Milkowski wrote:
 So I wasn't saying that it can work or that it can work in all  
 circumstances but rather I was trying to say that it probably shouldn't  
 be dismissed on a performance argument alone as for some use cases 

It would be of great utility even if considered only as a diagnostic
measure - ie, for qualifying tests or when something else raises
suspicion and you want to eliminate/confirm sources of problems. 

With a suitable pointer in a FAQ/troubleshooting guide, it could
reduce the number / improve the quality of problem reports related to
bad h/w. 

--
Dan.


pgp2jYRc6bDBB.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Ross Walker

On Mar 31, 2010, at 5:39 AM, Robert Milkowski mi...@task.gda.pl wrote:




On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss
  Use something other than Open/Solaris with ZFS as an NFS  
server?  :)


I don't think you'll find the performance you paid for with ZFS and
Solaris at this time. I've been trying to more than a year, and
watching dozens, if not hundreds of threads.
Getting half-ways decent performance from NFS and ZFS is impossible
unless you disable the ZIL.




Well, for lots of environments disabling ZIL is perfectly acceptable.
And frankly the reason you get better performance out of the box on  
Linux as NFS server is that it actually behaves like with disabled  
ZIL - so disabling ZIL on ZFS for NFS shares is no worse than using  
Linux here or any other OS which behaves in the same manner.  
Actually it makes it better as even if ZIL is disabled ZFS  
filesystem is always consisten on a disk and you still get all the  
other benefits from ZFS.


What would be useful though is to be able to easily disable ZIL per  
dataset instead of OS wide switch.
This feature has already been coded and tested and awaits a formal  
process to be completed in order to get integrated. Should be rather  
sooner than later.


Well being fair to Linux the default for NFS exports is to export them  
'sync' now which syncs to disk on close or fsync. It has been many  
years before they exported 'async' by default. Now if Linux admins set  
their shares 'async' and loose important data then it's operator error  
and not Linux's fault.


If apps don't care about their data consistency and don't sync their  
data I don't see why the file server has to care for them. I mean if  
it were a local file system and the machine rebooted the data would be  
lost too. Should we care more for data written remotely then locally?


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Richard Elling

On Mar 31, 2010, at 7:11 PM, Ross Walker wrote:

 On Mar 31, 2010, at 5:39 AM, Robert Milkowski mi...@task.gda.pl wrote:
 
 
 On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss
  Use something other than Open/Solaris with ZFS as an NFS server?  :)
 
 I don't think you'll find the performance you paid for with ZFS and
 Solaris at this time. I've been trying to more than a year, and
 watching dozens, if not hundreds of threads.
 Getting half-ways decent performance from NFS and ZFS is impossible
 unless you disable the ZIL.
 
 
 
 Well, for lots of environments disabling ZIL is perfectly acceptable.
 And frankly the reason you get better performance out of the box on Linux as 
 NFS server is that it actually behaves like with disabled ZIL - so disabling 
 ZIL on ZFS for NFS shares is no worse than using Linux here or any other OS 
 which behaves in the same manner. Actually it makes it better as even if ZIL 
 is disabled ZFS filesystem is always consisten on a disk and you still get 
 all the other benefits from ZFS.
 
 What would be useful though is to be able to easily disable ZIL per dataset 
 instead of OS wide switch.
 This feature has already been coded and tested and awaits a formal process 
 to be completed in order to get integrated. Should be rather sooner than 
 later.
 
 Well being fair to Linux the default for NFS exports is to export them 'sync' 
 now which syncs to disk on close or fsync. It has been many years before they 
 exported 'async' by default. Now if Linux admins set their shares 'async' and 
 loose important data then it's operator error and not Linux's fault.
 
 If apps don't care about their data consistency and don't sync their data I 
 don't see why the file server has to care for them. I mean if it were a local 
 file system and the machine rebooted the data would be lost too. Should we 
 care more for data written remotely then locally?

This is not true for sync data written locally, unless you disable the ZIL 
locally.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] can't destroy snapshot

2010-03-31 Thread Ian Collins

On 04/ 1/10 02:01 PM, Charles Hedrick wrote:

So we tried recreating the pool and sending the data again.

1) compression wasn't set on the copy, even though I did sent -R, which is 
supposed to send all properties
2) I tried killing to send | receive pipe. Receive couldn't be killed. It hung.
   


How long did you wait and how much data had been sent?

Killing a receive can take a (long!) while if it has to free all data 
already written.


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Ross Walker
On Mar 31, 2010, at 10:25 PM, Richard Elling  
richard.ell...@gmail.com wrote:




On Mar 31, 2010, at 7:11 PM, Ross Walker wrote:

On Mar 31, 2010, at 5:39 AM, Robert Milkowski mi...@task.gda.pl  
wrote:





On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss
Use something other than Open/Solaris with ZFS as an NFS  
server?  :)


I don't think you'll find the performance you paid for with ZFS and
Solaris at this time. I've been trying to more than a year, and
watching dozens, if not hundreds of threads.
Getting half-ways decent performance from NFS and ZFS is impossible
unless you disable the ZIL.




Well, for lots of environments disabling ZIL is perfectly  
acceptable.
And frankly the reason you get better performance out of the box  
on Linux as NFS server is that it actually behaves like with  
disabled ZIL - so disabling ZIL on ZFS for NFS shares is no worse  
than using Linux here or any other OS which behaves in the same  
manner. Actually it makes it better as even if ZIL is disabled ZFS  
filesystem is always consisten on a disk and you still get all the  
other benefits from ZFS.


What would be useful though is to be able to easily disable ZIL  
per dataset instead of OS wide switch.
This feature has already been coded and tested and awaits a formal  
process to be completed in order to get integrated. Should be  
rather sooner than later.


Well being fair to Linux the default for NFS exports is to export  
them 'sync' now which syncs to disk on close or fsync. It has been  
many years before they exported 'async' by default. Now if Linux  
admins set their shares 'async' and loose important data then it's  
operator error and not Linux's fault.


If apps don't care about their data consistency and don't sync  
their data I don't see why the file server has to care for them. I  
mean if it were a local file system and the machine rebooted the  
data would be lost too. Should we care more for data written  
remotely then locally?


This is not true for sync data written locally, unless you disable  
the ZIL locally.


No, of course if it's written sync with ZIL, it just seems over  
Solaris NFS all writes are delayed not just sync writes.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] can't destroy snapshot

2010-03-31 Thread Charles Hedrick
Ah, I hadn't thought about that. That may be what was happening. Thanks.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] can't destroy snapshot

2010-03-31 Thread Charles Hedrick
So that eliminates one of my concerns. However the other one is still an issue. 
Presumably Solaris Cluster shouldn't import a pool that's still active on the 
other system. We'll be looking more carefully into that.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] benefits of zfs root over ufs root

2010-03-31 Thread Jason King
On Wed, Mar 31, 2010 at 7:53 PM, Erik Trimble erik.trim...@oracle.com wrote:
 Brett wrote:

 Hi Folks,

 Im in a shop thats very resistant to change. The management here are
 looking for major justification of a move away from ufs to zfs for root file
 systems. Does anyone know if there are any whitepapers/blogs/discussions
 extolling the benefits of zfsroot over ufsroot?

 Regards in advance
 Rep


 I can't give you any links, but here's a short list of advantages:

 (1) all the standard ZFS advantages over UFS
 (2) LiveUpgrade/beadm related improvements
      (a)  much faster on ZFS
      (b)  don't need dedicated slice per OS instance, so it's far simpler to
 have N different OS installs
      (c)  very easy to keep track of which OS instance is installed where
 WITHOUT having to mount each one
      (d)  huge space savings (snapshots save lots of space on upgrades)
 (3) much more flexible swap space allocation (no hard-boundary slices)
 (4) simpler layout of filesystem partitions, and more flexible in changing
 directory size limits (e.g. /var )
 (5) mirroring a boot disk is simple under ZFS - much more complex under
 SVM/UFS
 (6) root-pool snapshots make backups trivially easy



 --
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA
 Timezone: US/Pacific (GMT-0800)

I don't think 2b is given enough emphasis.  The ability to quickly
clone your root filesystem, apply whatever change you need to (patch,
config change), reboot into the new environment, and be able to
provably back out to the prior state with easy is a life saver (yes
you could do this with ufs, but is assumes you have enough free slices
on your direct attached disks, and it takes _far_ longer simply
because you must first copy the entire boot environment first --
adding probably a few hours, versus the ~1s to snapshot + clone).
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] can't destroy snapshot

2010-03-31 Thread Ian Collins

On 04/ 1/10 02:01 PM, Charles Hedrick wrote:

So we tried recreating the pool and sending the data again.

1) compression wasn't set on the copy, even though I did sent -R, which is 
supposed to send all properties
   


Was compression explicitly set on the root filesystem of your set?

I don't think compression will be on if the root of a sent filesystem 
tree inherits the property from its parent.  I normally set compression 
on the the pool, then explicitly off on an any filesystems where it 
isn't appropriate.


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Edward Ned Harvey
 I see the source for some confusion.  On the ZFS Best Practices page:
 http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
 
 It says:
 Failure of the log device may cause the storage pool to be inaccessible
 if
 you are running the Solaris Nevada release prior to build 96 and a
 release
 prior to the Solaris 10 10/09 release.
 
 It also says:
 If a separate log device is not mirrored and the device that contains
 the
 log fails, storing log blocks reverts to the storage pool.

I have some more concrete data on this now.  Running Solaris 10u8 (which is
10/09), fully updated last weekend.  We want to explore the consequences of
adding or failing a non-mirrored log device.  We created a pool with a
non-mirrored ZIL log device.  And experimented with it:

(a)  Simply yank out the non-mirrored log device while the system is live.
The result was:  Any zfs or zpool command would hang permanently.  Even zfs
list hangs permanently.  The system cannot shutdown, cannot reboot, cannot
zfs send or zfs snapshot or anything ... It's a bad state.  You're
basically hosed.  Power cycle is the only option.

(b)  After power cycling, the system won't boot.  It gets part way through
the boot process, and eventually just hangs there, infinitely cycling error
messages about services that couldn't start.  Random services, such as
inetd, which seem unrelated to some random data pool that failed.  So we
power cycle again, and go into failsafe mode, to clean up and destroy the
old messed up pool ... Boot up totally clean again, and create a new totally
clean pool with a non-mirrored log device.  Just to ensure we really are
clean, we simply zpool export and zpool import with no trouble, and
reboot once for good measure.  zfs list and everything are all working
great...

(c)  Do a zpool export.  Obviously, the ZIL log device is clean and
flushed at this point, not being used.  We simply yank out the log device,
and do zpool import.  Well ... Without that log device, I forget the
terminology, it said something like missing disk.  Plain and simple, you
*can* *not* import the pool without the log device.  It does not say to
force use -f and even if you specify the -f, it still just throws the same
error message, missing disk or whatever.  Won't import.  Period.

...  So, to anybody who said the failed log device will simply fail over to
blocks within the main pool:  Sorry.  That may be true in some later
version, but it is not the slightest bit true in the absolute latest solaris
(proper) available today.

I'm going to venture a guess this is no longer a problem, after zpool
version 19.  This is when ZFS log device removal was introduced.

Unfortunately, the latest version of solaris only goes up to zpool version
15.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Edward Ned Harvey
 A MegaRAID card with write-back cache? It should also be cheaper than
 the F20.

I haven't posted results yet, but I just finished a few weeks of extensive
benchmarking various configurations.  I can say this:

WriteBack cache is much faster than naked disks, but if you can buy an SSD
or two for ZIL log device, the dedicated ZIL is yet again much faster than
WriteBack.

It doesn't have to be F20.  You could use the Intel X25 for example.  If
you're running solaris proper, you better mirror your ZIL log device.  If
you're running opensolaris ... I don't know if that's important.  I'll
probably test it, just to be sure, but I might never get around to it
because I don't have a justifiable business reason to build the opensolaris
machine just for this one little test.

Seriously, all disks configured WriteThrough (spindle and SSD disks alike)
using the dedicated ZIL SSD device, very noticeably faster than enabling the
WriteBack.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Edward Ned Harvey
 We ran into something similar with these drives in an X4170 that turned
 out to
 be  an issue of the preconfigured logical volumes on the drives. Once
 we made
 sure all of our Sun PCI HBAs where running the exact same version of
 firmware
 and recreated the volumes on new drives arriving from Sun we got back
 into sync
 on the X25-E devices sizes.

Can you elaborate?  Just today, we got the replacement drive that has
precisely the right version of firmware and everything.  Still, when we
plugged in that drive, and create simple volume in the storagetek raid
utility, the new drive is 0.001 Gb smaller than the old drive.  I'm still
hosed.

Are you saying I might benefit by sticking the SSD into some laptop, and
zero'ing the disk?  And then attach to the sun server?

Are you saying I might benefit by finding some other way to make the drive
available, instead of using the storagetek raid utility?

Thanks for the suggestions...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Stuart Anderson

On Mar 31, 2010, at 8:58 PM, Edward Ned Harvey wrote:

 We ran into something similar with these drives in an X4170 that turned
 out to
 be  an issue of the preconfigured logical volumes on the drives. Once
 we made
 sure all of our Sun PCI HBAs where running the exact same version of
 firmware
 and recreated the volumes on new drives arriving from Sun we got back
 into sync
 on the X25-E devices sizes.
 
 Can you elaborate?  Just today, we got the replacement drive that has
 precisely the right version of firmware and everything.  Still, when we
 plugged in that drive, and create simple volume in the storagetek raid
 utility, the new drive is 0.001 Gb smaller than the old drive.  I'm still
 hosed.
 
 Are you saying I might benefit by sticking the SSD into some laptop, and
 zero'ing the disk?  And then attach to the sun server?
 
 Are you saying I might benefit by finding some other way to make the drive
 available, instead of using the storagetek raid utility?

Assuming you are also using a PCI LSI HBA from Sun that is managed with
a utility called /opt/StorMan/arcconf and reports itself as the amazingly
informative model number Sun STK RAID INT what worked for me was to run,
arcconf delete (to delete the pre-configured volume shipped on the drive)
arcconf create (to create a new volume)

What I observed was that
arcconf getconfig 1
would show the same physical device size for our existing drives and new
ones from Sun, but they reported a slightly different logical volume size.
I am fairly sure that was due to the Sun factory creating the initial volume
with a different version of the HBA controller firmware then we where using
to create our own volumes.

If I remember the sign correctly, the newer firmware creates larger logical
volumes, and you really want to upgrade the firmware if you are going to
be running multiple X25-E drives from the same controller.

I hope that helps.


--
Stuart Anderson  ander...@ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Richard Elling
On Mar 31, 2010, at 9:22 AM, Edward Ned Harvey wrote:

 Would your users be concerned if there was a possibility that
 after extracting a 50 MB tarball that files are incomplete, whole
 subdirectories are missing, or file permissions are incorrect?
 
 Correction:  Would your users be concerned if there was a possibility that
 after extracting a 50MB tarball *and having a server crash* then files could
 be corrupted as described above.
 
 If you disable the ZIL, the filesystem still stays correct in RAM, and the
 only way you lose any data such as you've described, is to have an
 ungraceful power down or reboot.
 
 The advice I would give is:  Do zfs autosnapshots frequently (say ... every
 5 minutes, keeping the most recent 2 hours of snaps) and then run with no
 ZIL.  If you have an ungraceful shutdown or reboot, rollback to the latest
 snapshot ... and rollback once more for good measure.  As long as you can
 afford to risk 5-10 minutes of the most recent work after a crash, then you
 can get a 10x performance boost most of the time, and no risk of the
 aforementioned data corruption.

This approach does not solve the problem.  When you do a snapshot, 
the txg is committed.  If you wish to reduce the exposure to loss of
sync data and run with ZIL disabled, then you can change the txg commit 
interval -- however changing the txg commit interval will not eliminate the 
possibility of data loss.

 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss