Re: [zfs-discuss] Nexsan SATABeast and ZFS

2009-03-11 Thread Lars-Gunnar Persson

I would like to go back to my question for a second:

I checked with my Nexsan supplier and they confirmed that access to  
every single disk in SATABeast is not possible. The smallest entities  
I can create on the SATABeast are RAID 0 or 1 arrays. With RAID 1 I'll  
loose too much disk space and I believe that leaves me with RAID 0 as  
the only reasonable option. But with this unsecure RAID format I'll  
need higher redundancy in the ZFS configuration. I think I'll go with  
the following configuration:


On the Nexsan SATABeast:
* 14 disks configured in 7 RAID arrays with RAID level 0 (each disk is  
1 TB which gives me a total of 14 TB raw disk space).

* Each RAID 0 array configured as one volume.

On the Sun Fire X4100 M2 with Solaris 10:
* Add all 7 volumes to one zpool configured in on raidz2 (gives me  
approx. 8,8 TB available disk space)


Any comments or suggestions?

Best regards, Lars-Gunnar Persson

On 11. mars. 2009, at 02.39, Bob Friesenhahn wrote:


On Tue, 10 Mar 2009, A Darren Dunham wrote:


What part isn't true?  ZFS has a independent checksum for the data
block.  But if the data block is spread over multiple disks, then  
each

of the disks have to be read to verify the checksum.


I interpreted what you said to imply that RAID6 type algorithms were  
being used to validate the data, rather than to correct wrong data.   
I agree that it is necessary to read a full ZFS block in order to  
use the ZFS block checksum.  I also agree that a raidz2 vdev has  
IOPS behavior which is similar to a single disk.



From what I understand, a raidz2 with a very large number of disks
won't use all of the disks to store one ZFS block.  There is a  
maximum number of disks in a stripe which can be supported by the  
ZFS block size.


--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nexsan SATABeast and ZFS

2009-03-11 Thread Moore, Joe
Lars-Gunnar Persson wrote:
 I would like to go back to my question for a second:
 
 I checked with my Nexsan supplier and they confirmed that access to
 every single disk in SATABeast is not possible. The smallest entities
 I can create on the SATABeast are RAID 0 or 1 arrays. With RAID 1 I'll
 loose too much disk space and I believe that leaves me with RAID 0 as
 the only reasonable option. But with this unsecure RAID format I'll
 need higher redundancy in the ZFS configuration. I think I'll go with
 the following configuration:
 
 On the Nexsan SATABeast:
 * 14 disks configured in 7 RAID arrays with RAID level 0 (each disk is
 1 TB which gives me a total of 14 TB raw disk space).
 * Each RAID 0 array configured as one volume.

So what the front end will see is 7 disks, 2TB each disk.

 
 On the Sun Fire X4100 M2 with Solaris 10:
 * Add all 7 volumes to one zpool configured in on raidz2 (gives me
 approx. 8,8 TB available disk space)

You'll get 5 LUNs worth of space in this config, or 10TB of usable space.

 
 Any comments or suggestions?

Given the hardware constraints (no single-disk volumes allowed) this is a good 
configuration for most purposes.

The advantages/disadvantages are:
. 10TB of usable disk space, out of 14TB purchased.
. At least three hard disk failures are required to lose the ZFS pool.
. Random non-cached read performance will be about 300 IO/sec.
. Sequential reads and writes of the whole ZFS blocksize will be fast (up to 
2000 IO/sec).
. One hard drive failure will cause the used blocks of the 2TB LUN (raid0 pair) 
to be resilvered, even though the other half of the pair is not damaged.  The 
other half of the pair is more likely to fail during the ZFS resilvering 
operation because of increased load.

You'll want to pay special attention to the cache settings on the Nexsan.  You 
earlier showed that the write cache is enabled, but IIRC the array doesn't have 
a nonvolatile (battery-backed) cache.  If that's the case, MAKE SURE it's 
hooked up to a UPS that can support it for the 30 second cache flush timeout on 
the array.  And make sure you don't power it down hard.  I think you want to 
uncheck the ignore FUA setting, so that FUA requests are respected.  My guess 
is that this will cause the array to properly handle the cache_flush requests 
that ZFS uses to ensure data consistancy.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nexsan SATABeast and ZFS

2009-03-10 Thread Lars-Gunnar Persson

How about this configuration?

On the Nexsan SATABeast, add all disk to one RAID 5 or 6 group. Then  
on the Nexsan define several smaller volumes and then add those  
volumes to a raidz2/raidz zpool?


Could that be an useful configuration? Maybe I'll loose too much space  
with double raid 5 or 6 configuration? What about performance?


Regards,

Lars-Gunnar Persson

On 10. mars. 2009, at 00.26, Kees Nuyt wrote:


On Mon, 9 Mar 2009 12:06:40 +0100, Lars-Gunnar Persson
lars-gunnar.pers...@nersc.no wrote:


1. On the external disk array, I not able to configure JBOD or RAID 0
or 1 with just one disk.


In some arrays it seems to be possible to configure separate
disks by offering the array just one disk in one slot at a
time, and, very important, leaving all other slots empty(!).

Repeat for as many disks as you have, seating each disk in
its own slot, and all other slots empty.

(ok, it's just hear-say, but it might be worth a try with
the first 4 disks or so).
--
 (  Kees Nuyt
 )
c[_]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nexsan SATABeast and ZFS

2009-03-10 Thread Lars-Gunnar Persson
I realized that I'll loose too much disk space with the double raid  
configuration suggested below. Agree?


I've done some performance testing with raidz/raidz1 vs raidz2:

bash-3.00# zpool status -v raid5
  pool: raid5
 state: ONLINE
 scrub: none requested
config:

NAME   STATE READ  
WRITE CKSUM
raid5  ONLINE   0  
0 0
  raidz1   ONLINE   0  
0 0
c7t6000402001FC442C609DC5A3d0  ONLINE   0  
0 0
c7t6000402001FC442C609DCA4Ad0  ONLINE   0  
0 0
c7t6000402001FC442C609DCA22d0  ONLINE   0  
0 0
c7t6000402001FC442C609DCABFd0  ONLINE   0  
0 0
c7t6000402001FC442C609DCADBd0  ONLINE   0  
0 0
c7t6000402001FC442C609DCAF8d0  ONLINE   0  
0 0
c7t6000402001FC442C609F0291d0  ONLINE   0  
0 0


errors: No known data errors

bash-3.00# zpool list
NAMESIZEUSED   AVAILCAP  HEALTH ALTROOT
raid5  12.6T141K   12.6T 0%  ONLINE -

bash-3.00# df -h /raid5
Filesystem size   used  avail capacity  Mounted on
raid5   11T41K11T 1%/raid5

bash-3.00# echo zfs_nocacheflush/D | mdb -k
zfs_nocacheflush:
zfs_nocacheflush:   0
bash-3.00# ./filesync-1 /raid5 1
Time in seconds to create and unlink 1 files with O_DSYNC: 9.871197

bash-3.00# echo zfs_nocacheflush/W1 | mdb -kw
zfs_nocacheflush:   0   =   0x1
bash-3.00# ./filesync-1 /raid5 1
Time in seconds to create and unlink 1 files with O_DSYNC: 7.363303

Then I destroyed the raid5 pool and created a raid6 pool:

bash-3.00# zpool status -v raid6  pool: raid6 state: ONLINE scrub:  
none requested

config:

NAME   STATE READ  
WRITE CKSUM
raid6  ONLINE   0  
0 0
  raidz2   ONLINE   0  
0 0
c7t6000402001FC442C609DC5A3d0  ONLINE   0  
0 0
c7t6000402001FC442C609DCA4Ad0  ONLINE   0  
0 0
c7t6000402001FC442C609DCA22d0  ONLINE   0  
0 0
c7t6000402001FC442C609DCABFd0  ONLINE   0  
0 0
c7t6000402001FC442C609DCADBd0  ONLINE   0  
0 0
c7t6000402001FC442C609DCAF8d0  ONLINE   0  
0 0
c7t6000402001FC442C609F0291d0  ONLINE   0  
0 0


errors: No known data errors

bash-3.00# zpool list
NAMESIZEUSED   AVAILCAP  HEALTH ALTROOT
raid6  12.6T195K   12.6T 0%  ONLINE -

bash-3.00# df -h /raid6
Filesystem size   used  avail capacity  Mounted on
raid6  8.8T52K   8.8T 1%/raid6

bash-3.00# echo zfs_nocacheflush/D | mdb -k
zfs_nocacheflush:
zfs_nocacheflush:   0
bash-3.00# ./filesync-1 /raid6 1
Time in seconds to create and unlink 1 files with O_DSYNC: 9.879219

bash-3.00# echo zfs_nocacheflush/W1 | mdb -kw
zfs_nocacheflush:   0   =   0x1
bash-3.00# ./filesync-1 /raid6 1
Time in seconds to create and unlink 1 files with O_DSYNC: 7.560435

My conclusion on raidz1 vs raidz2 would be no difference in  
performance and big difference in disk space available.



On 10. mars. 2009, at 09.13, Lars-Gunnar Persson wrote:


How about this configuration?

On the Nexsan SATABeast, add all disk to one RAID 5 or 6 group. Then  
on the Nexsan define several smaller volumes and then add those  
volumes to a raidz2/raidz zpool?


Could that be an useful configuration? Maybe I'll loose too much  
space with double raid 5 or 6 configuration? What about performance?


Regards,

Lars-Gunnar Persson

On 10. mars. 2009, at 00.26, Kees Nuyt wrote:


On Mon, 9 Mar 2009 12:06:40 +0100, Lars-Gunnar Persson
lars-gunnar.pers...@nersc.no wrote:

1. On the external disk array, I not able to configure JBOD or  
RAID 0

or 1 with just one disk.


In some arrays it seems to be possible to configure separate
disks by offering the array just one disk in one slot at a
time, and, very important, leaving all other slots empty(!).

Repeat for as many disks as you have, seating each disk in
its own slot, and all other slots empty.

(ok, it's just hear-say, but it might be worth a try with
the first 4 disks or so).
--
(  Kees Nuyt
)
c[_]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org

Re: [zfs-discuss] Nexsan SATABeast and ZFS

2009-03-10 Thread Lars-Gunnar Persson

Test 1:
bash-3.00# echo zfs_nocacheflush/D | mdb -k
zfs_nocacheflush:
zfs_nocacheflush:   0
bash-3.00# ./filesync-1 /raid6 1
Time in seconds to create and unlink 1 files with O_DSYNC:  
292.223081


bash-3.00# echo zfs_nocacheflush/W1 | mdb -kw
zfs_nocacheflush:   0   =   0x1
bash-3.00# ./filesync-1 /raid6 1
Time in seconds to create and unlink 1 files with O_DSYNC:  
288.099066


Test 2:
ash-3.00# echo zfs_nocacheflush/D | mdb -k
zfs_nocacheflush:
zfs_nocacheflush:   0
bash-3.00# ./filesync-1 /raid6 1
Time in seconds to create and unlink 1 files with O_DSYNC: 13.092332

bash-3.00# echo zfs_nocacheflush/W1 | mdb -kw
zfs_nocacheflush:   0   =   0x1
bash-3.00# ./filesync-1 /raid6 1
Time in seconds to create and unlink 1 files with O_DSYNC: 9.591622


Test 3:
bash-3.00# echo zfs_nocacheflush/D | mdb -k
zfs_nocacheflush:
zfs_nocacheflush:   0
bash-3.00# ./filesync-1 /raid6 1
Time in seconds to create and unlink 1 files with O_DSYNC: 9.879219

bash-3.00# echo zfs_nocacheflush/W1 | mdb -kw
zfs_nocacheflush:   0   =   0x1
bash-3.00# ./filesync-1 /raid6 1
Time in seconds to create and unlink 1 files with O_DSYNC: 7.560435


Thank you for your reply. If I make a raidz or a raidz2 on the Solaris  
box, then I get enough redundancy?


The Nexsan can't do block-level snapshots.

On 9. mars. 2009, at 18.27, Miles Nordin wrote:

lp == Lars-Gunnar Persson lars-gunnar.pers...@nersc.no  
writes:


   lp Ignore force unit access (FUA) bit: [X]

   lp Any thoughts about this?

run three tests

(1) write cache disabled

(2) write cache enabled, ignore FUA off

(3) write cache enabled, ignore FUA [X]

if all three are the same, either the test is broken or something is  
wrong.


If (2) and (3) are the same, then ZFS is working as expected.  run
with either (2) or (3).

If (1) and (2) are the same, then ZFS doesn't seem to have the
cache-flush changes implemented, or else they aren't sufficient for
your array.  You could look into it more, file a bug, something like
that.

If you're not interested in investigating more and filing a bug in the
last case, then you could just set (3), and do no testing at all.

It might be good to do some of the testing just to catch wtf cases
like, ``oh sorry, we didn't sell you a cache.'' or ``the battery is
dead?  how'd that happen?'' but there are so many places for wtf cases
maybe this isn't the one to worry about.

but I'm not sure why you are switching to one big single-disk vdev
with FC if you are having ZFS corruption problems.  I think everyone's
been saying you are more likely to have problems by using FC instead
of direct attached, and also by using FC without vdev redundancy (the
redundancy seems to work around some bugs).  At least the people
reporting lots of lost pools are the ones using FC and iSCSI, who lose
pools during target reboots or SAN outages.

I suppose if the NexSAN can do block-level snapshots, you could
snapshot exported copies of your pool from time to time, and roll back
to a snapshot if ZFS refuses to import your ``corrupt'' pool.  In that
way it could help you?



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nexsan SATABeast and ZFS

2009-03-10 Thread Tim
On Tue, Mar 10, 2009 at 3:13 AM, Lars-Gunnar Persson 
lars-gunnar.pers...@nersc.no wrote:

 How about this configuration?

 On the Nexsan SATABeast, add all disk to one RAID 5 or 6 group. Then on the
 Nexsan define several smaller volumes and then add those volumes to a
 raidz2/raidz zpool?

 Could that be an useful configuration? Maybe I'll loose too much space with
 double raid 5 or 6 configuration? What about performance?



Bad idea.   The probability of a double disk failure in this scenario is
relatively high (48 disks, one/two parity drive), and when you do so, you'll
lose ALL data.

I don't quite follow the reasoning behind putting software raid on top of
your hardware raid either.  Either you trust you have a solid raid-solution
on the back end or you don't.  If you don't sell it and buy something else.
 If you do, put the zfs filesystem directly on top without software raid and
be done with it.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nexsan SATABeast and ZFS

2009-03-10 Thread Moore, Joe
Bob Friesenhahn wrote:
 Your idea to stripe two disks per LUN should work.  Make sure to use
 raidz2 rather than plain raidz for the extra reliability.  This
 solution is optimized for high data throughput from one user.

Striping two disks per LUN (RAID0 on 2 disks) and then adding a ZFS form of 
redundancy (either mirror or raidz[2]) would be an efficient use of space.  
There would be no additional space overhead caused by running that way.

Note, however, that if you do this, ZFS must resilver the larger LUN in the 
event of a single disk failure on the backend.  This means a longer time to 
rebuild, and a lot of extra work on the other (non-failed) half of the RAID0 
stripe.

 
 An alternative is to create individual RAID 0 LUNs which actually
 only contain a single disk.  

This is certainly preferable, since the unit of failure at the hardware level 
corresponds to the unit of resilvering at the ZFS level.  And at least on my 
Nexsan SATAboy(2f) this configuration is possible.

 Then implement the pool as two raidz2s
 with six LUNs each, and two hot spares.  That would be my own
 preference.  Due to ZFS's load share this should provide better
 performance (perhaps 2X) for multi-user loads.  Some testing may be
 required to make sure that your hardware is happy with this.

I disagree with this suggestion.  With this config, you only get 8 disks worth 
of storage, out of the 14, which is a ~42% overhead.  In order to lose data in 
this scenario, 3 disks would have to fail out of a single 6-disk group before 
zfs is able to resilver any of them to the hot spares.  That seems (to me) a 
lot more redundancy than is needed.

As far as workload, any time you use RAIDZ[2], ZFS must read the entire stripe 
(across all of the disks) in order to verify the checksum for that data block.  
This means that a 128k read (the default zfs blocksize) requires a 32kb read 
from each of 6 disks, which may include a relatively slow seek to the relevant 
part of the spinning rust.  So for random I/O, even though the data is striped 
across all the disks, you will see only a single disks's worth of throughput.  
For sequential I/O, you'll see the full RAID set's worth of throughput.

If you are expecting a non-sequential workload, you would be better off taking 
the 50% storage overhead to do ZFS mirroring.

 
 Avoid RAID5 if you can because it is not as reliable with today's
 large disks and the resulting huge LUN size can take a long time to
 resilver if the RAID5 should fail (or be considered to have failed).

Here's a place that ZFS shines: it doesn't resilver the whole disk, just the 
data blocks.  So it doesn't have to read the full array to rebuild a failed 
disk, so it's less likely to cause a subsequent failure during parity rebuild.

My $.02.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nexsan SATABeast and ZFS

2009-03-10 Thread Bob Friesenhahn

On Tue, 10 Mar 2009, Lars-Gunnar Persson wrote:


My conclusion on raidz1 vs raidz2 would be no difference in performance and 
big difference in disk space available.


I am not so sure about the big difference in disk space available. 
Disk capacity is cheap, but failure is not.


If you need to make up the difference in disk space, then use raidz2 
and don't allocate any spare disks.  Just make sure that you have a 
spare disk drive handy, or will be able to purchase one in a 
reasonable amount of time.


With raidz1 or RAID5 you are left feeling naked and exposed as soon as 
one disk fails, and you realize that the remaining disk drives will 
need to work perfectly in order to preserve your data.  With raidz2, 
losing a single disk drive does not leave you naked and exposed.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nexsan SATABeast and ZFS

2009-03-10 Thread Richard Elling

[amplification of Joe's point below...]

Moore, Joe wrote:

Bob Friesenhahn wrote:
  

Your idea to stripe two disks per LUN should work.  Make sure to use
raidz2 rather than plain raidz for the extra reliability.  This
solution is optimized for high data throughput from one user.



Striping two disks per LUN (RAID0 on 2 disks) and then adding a ZFS form of 
redundancy (either mirror or raidz[2]) would be an efficient use of space.  
There would be no additional space overhead caused by running that way.
  


It would also reduce your per-vdev MTBF by half.  In general, better
reliability at the vdev level is a good thing.  For example, consider the
case where we have 6 same-sized disks.  We can configure them in two
different ways using 2+1 RAID-5 sets:

   configuration MTTDL[1]
   --
   RAID-5+0  188,297
   RAID-0+5   94,149

The MTTDL[1] model does consider MTTR, which is a combination of
the logistical response time and reconstruction time. Unless you
have zero response time and reconstruction time, RAID-0+5 is not
as good as RAID-5+0.

Note, however, that if you do this, ZFS must resilver the larger LUN in the event of a 
single disk failure on the backend.  This means a longer time to rebuild, and a lot of 
extra work on the other (non-failed) half of the RAID0 stripe.
  


ZFS resilver tends to be I/O bound in one of two ways: bandwidth on the
resilvering vdev and iops on the surviving vdevs.  You might consider
this when you use hardware RAID vdevs.
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nexsan SATABeast and ZFS

2009-03-10 Thread Bob Friesenhahn

On Tue, 10 Mar 2009, Moore, Joe wrote:

As far as workload, any time you use RAIDZ[2], ZFS must read the 
entire stripe (across all of the disks) in order to verify the 
checksum for that data block.  This means that a 128k read (the 
default zfs blocksize) requires a 32kb read from each of 6 disks, 
which may include a relatively slow seek to the relevant part of the 
spinning rust.  So for random I/O, even though the data is striped


This is not quite true.  Raidz2 is not the same as RAID6.  ZFS has an 
independent checksum for its data blocks.  The traditional RAID type 
technology is used to repair in case data corruption is detected.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nexsan SATABeast and ZFS

2009-03-10 Thread Mattias Pantzare
On Tue, Mar 10, 2009 at 23:57, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Tue, 10 Mar 2009, Moore, Joe wrote:

 As far as workload, any time you use RAIDZ[2], ZFS must read the entire
 stripe (across all of the disks) in order to verify the checksum for that
 data block.  This means that a 128k read (the default zfs blocksize)
 requires a 32kb read from each of 6 disks, which may include a relatively
 slow seek to the relevant part of the spinning rust.  So for random I/O,
 even though the data is striped

 This is not quite true.  Raidz2 is not the same as RAID6.  ZFS has an
 independent checksum for its data blocks.  The traditional RAID type
 technology is used to repair in case data corruption is detected.

What he is saying is true. RAIDZ will spread blocks on all disks, and
therefore requires full stripe reads to read the block. The good thing
is that it will always do full stripe writes so writes are fast.

RAID6 has no blocks so you can read any sector by reading from 1 disk,
you only have to read from the other disks in the stipe in case of a
fault.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nexsan SATABeast and ZFS

2009-03-10 Thread A Darren Dunham
On Tue, Mar 10, 2009 at 05:57:16PM -0500, Bob Friesenhahn wrote:
 On Tue, 10 Mar 2009, Moore, Joe wrote:
 
 As far as workload, any time you use RAIDZ[2], ZFS must read the 
 entire stripe (across all of the disks) in order to verify the 
 checksum for that data block.  This means that a 128k read (the 
 default zfs blocksize) requires a 32kb read from each of 6 disks, 
 which may include a relatively slow seek to the relevant part of the 
 spinning rust.  So for random I/O, even though the data is striped
 
 This is not quite true.  Raidz2 is not the same as RAID6.  ZFS has an 
 independent checksum for its data blocks.  The traditional RAID type 
 technology is used to repair in case data corruption is detected.

What part isn't true?  ZFS has a independent checksum for the data
block.  But if the data block is spread over multiple disks, then each
of the disks have to be read to verify the checksum.

-- 
Darren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nexsan SATABeast and ZFS

2009-03-10 Thread Bob Friesenhahn

On Tue, 10 Mar 2009, A Darren Dunham wrote:


What part isn't true?  ZFS has a independent checksum for the data
block.  But if the data block is spread over multiple disks, then each
of the disks have to be read to verify the checksum.


I interpreted what you said to imply that RAID6 type algorithms were 
being used to validate the data, rather than to correct wrong data.  I 
agree that it is necessary to read a full ZFS block in order to use 
the ZFS block checksum.  I also agree that a raidz2 vdev has IOPS 
behavior which is similar to a single disk.


From what I understand, a raidz2 with a very large number of disks 
won't use all of the disks to store one ZFS block.  There is a maximum 
number of disks in a stripe which can be supported by the ZFS block 
size.


--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Nexsan SATABeast and ZFS

2009-03-09 Thread Lars-Gunnar Persson
I'm trying to implement a Nexsan SATABeast (an external disk array,  
read more: http://www.nexsan.com/satabeast.php, 14 disks available)  
with a Sun Fire X4100 M2 server running Solaris 10 u6 (connected via  
fiber) and have a couple of questions:


(My motivation for this is the corrupted ZFS volume discussion I had  
earlier with no result, and this time I'm trying to make a more robust  
implementation)


1. On the external disk array, I not able to configure JBOD or RAID 0  
or 1 with just one disk. I can't find any options for my Solaris  
server to access the disk directly so I have to configure some raids  
on the SATABeast. I was thinking of striping two disks in each raid  
and then add all 7 raids to one zpool as a zraid. The problem with  
this is if one disk breaks down, I'll loose one RAID 0 disk but maybe  
ZFS can handle this? Should I rather implement RAID5 disks one the  
SATABeast and then export them to the Solaris machine? 14 disks would  
give me 4 RAID5 volumes and 2 spare disks? I'll loose a lot of disk  
space. What about create larger RAID volumes on the SATABeast? Like 3  
RAID volumes with 5 disks in 2 RAIDS and 4 disks in one RAID? I'm  
really not sure what to choose ... At the moment I've striped two  
disks in one RAID volume.


2. After reading from the ZFS Evil Tuning Guide (http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes 
) about cache flushes I checked the cache configuration on the  
SATABaest and I can change these settings:


System Admin
Configure Cache

Cache Configuration
Current write cache state: Enabled, FUA ignored - 495 MB
Manually override current write cache status: [ ] Force write cache to  
Disabled

Desired write cache state: [X] Enabled [ ] Disabled
Allow attached host to override write cache configuration: [ ]
Ignore force unit access (FUA) bit: [X]
Write cache streaming mode: [ ]
Cache optimization setting:
 [ ] Random access
 [X] Mixed sequential/random
 [ ] Sequential access

And from the help section:

Write cache will normally speed up host writes,  data is buffered in  
the RAID controllers memory when the installed disk drives are not  
ready to accept the write data. The RAID controller write cache  
memory is battery backed, this allows any unwritten array data to be  
kept intact during a power failure situation. When power is restored  
this battery backed data will be flushed out to the RAID array.


Current write cache state - This is the current state of the write  
cache that the RAID system is using.


Manually override current write cache status - This allows the write  
caching to be forced on or off by the user, this change will take  
effect immediately.


Desired write cache state - This is the state of the write cache the  
user wishes to have after boot up.


Allow attached host to override write cache configuration - This  
allows the host system software to issue commands to the RAID system  
via the host interface that will either turn off or on the write  
caching.


Ignore force unit access (FUA) bit - When the force unit access  
(FUA) bit is set by a host system on a per command basis data is  
written / read directly to / from the disks without using the  
onboard cache. This will incur a time overhead, but guarantees the  
data is on the media. Set this option to force the controller to  
ignore the FUA bit such that command execution times are more  
consistent.


Write cache streaming mode - When the write cache is configured in  
streaming mode (check box ticked), the system continuously flushes  
the cache (it runs empty). This provides maximum cache buffering to  
protect against raid system delays adversely affecting command  
response times to the host.
When the write cache operates in non-streaming mode (check box not  
ticked) the system runs with a full write cache to maximise cache  
hits and maximise random IO performance.


Cache optimization setting - The cache optimization setting adjusts  
the cache behaviour to maximize performance for the expected host I/ 
O pattern.


Note that the write cache will be flushed 5 seconds after the last  
host write. It is recommended that all host activity is stopped 30  
seconds before powering the system off.


Any thoughts about this?

Regards,

Lars-Gunnar Persson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nexsan SATABeast and ZFS

2009-03-09 Thread Bob Friesenhahn

On Mon, 9 Mar 2009, Lars-Gunnar Persson wrote:


1. On the external disk array, I not able to configure JBOD or RAID 0 or 1 
with just one disk. I can't find any options for my Solaris server to access 
the disk directly so I have to configure some raids on the SATABeast. I was 
thinking of striping two disks in each raid and then add all 7 raids to one 
zpool as a zraid. The problem with this is if one disk breaks down, I'll 
loose one RAID 0 disk but maybe ZFS can handle this? Should I rather 
implement RAID5 disks one the SATABeast and then export them to the Solaris 
machine? 14 disks would give me 4 RAID5 volumes and 2 spare disks? I'll loose 
a lot of disk space. What about create larger RAID volumes on the SATABeast? 
Like 3 RAID volumes with 5 disks in 2 RAIDS and 4 disks in one RAID? I'm 
really not sure what to choose ... At the moment I've striped two disks in 
one RAID volume.


Your idea to stripe two disks per LUN should work.  Make sure to use 
raidz2 rather than plain raidz for the extra reliability.  This 
solution is optimized for high data throughput from one user.


An alternative is to create individual RAID 0 LUNs which actually 
only contain a single disk.  Then implement the pool as two raidz2s 
with six LUNs each, and two hot spares.  That would be my own 
preference.  Due to ZFS's load share this should provide better 
performance (perhaps 2X) for multi-user loads.  Some testing may be 
required to make sure that your hardware is happy with this.


Avoid RAID5 if you can because it is not as reliable with today's 
large disks and the resulting huge LUN size can take a long time to 
resilver if the RAID5 should fail (or be considered to have failed). 
There is also the issue that a RAID array bug might cause transient 
wrong data to be returned and this could cause confusion for ZFS's own 
diagnostics/repair and result in useless repairs.  If ZFS reports a 
problem but the RAID array says that the data is fine, then there is 
confusion, finger-pointing, and likely a post to this list.  If you 
are already using ZFS, then you might as well use ZFS for most of the 
error detection/correction as well.


These are my own opinions and others will surely differ.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nexsan SATABeast and ZFS

2009-03-09 Thread Frank Cusack
On March 9, 2009 12:06:40 PM +0100 Lars-Gunnar Persson 
lars-gunnar.pers...@nersc.no wrote:

I'm trying to implement a Nexsan SATABeast

...

1. On the external disk array, I not able to configure JBOD or RAID 0 or
1 with just one disk.


exactly why i didn't buy this product.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nexsan SATABeast and ZFS

2009-03-09 Thread Kees Nuyt
On Mon, 9 Mar 2009 12:06:40 +0100, Lars-Gunnar Persson
lars-gunnar.pers...@nersc.no wrote:

1. On the external disk array, I not able to configure JBOD or RAID 0  
or 1 with just one disk. 

In some arrays it seems to be possible to configure separate
disks by offering the array just one disk in one slot at a
time, and, very important, leaving all other slots empty(!).

Repeat for as many disks as you have, seating each disk in
its own slot, and all other slots empty.

(ok, it's just hear-say, but it might be worth a try with
the first 4 disks or so).
-- 
  (  Kees Nuyt
  )
c[_]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss