Re: [zfs-discuss] zfs-discuss Digest, Vol 59, Issue 13

2010-09-09 Thread Dr. Martin Mundschenk

Am 09.09.2010 um 07:00 schrieb zfs-discuss-requ...@opensolaris.org:

 What's the write workload like?  You could try disabling the ZIL to see
 if that makes a difference.  If it does, the addition of an SSD-based
 ZIL / slog device would most certainly help.
 
 Maybe you could describe the makeup of your zpool as well?
 
 Ray


The zpool is a mirrored root-pool (2 SATA 250GB devices). The box is a Dell PE 
T710. When I copy via NFS, zpool iostat reports 4MB/sec along the copy process. 
When I copy via scp I get a network performance of about 50 MB/sec and zpool 
iostat reports 105 MB/sec for a short interval about 5 seconds after scp 
completed. 

As far as I figured out, the problem is the nfs commit, that forces the 
filesystem to write data directly on disk instead of caching the data stream, 
like it is done in the scp example. 

NFS was there long before SSD-based drives where. I can not imagine, that NFS 
performance used to be not more than 1/3 of the speed of a 10BaseT connection 
ever before...

Martin
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Erik Trimble

 On 9/8/2010 10:08 PM, Freddie Cash wrote:

On Wed, Sep 8, 2010 at 6:27 AM, Edward Ned Harveysh...@nedharvey.com  wrote:

Both of the above situations resilver in equal time, unless there is a bus
bottleneck.  21 disks in a single raidz3 will resilver just as fast as 7
disks in a raidz1, as long as you are avoiding the bus bottleneck.  But 21
disks in a single raidz3 provides better redundancy than 3 vdev's each
containing a 7 disk raidz1.

No, it (21-disk raidz3 vdev) most certainly will not resilver in the
same amount of time.  In fact, I highly doubt it would resilver at
all.

My first foray into ZFS resulted in a 24-disk raidz2 vdev using 500 GB
Seagate ES.2 and WD RE3 drives connected to 3Ware 9550SXU and 9650SE
multilane controllers.  Nice 10 TB storage pool.  Worked beatifully as
we filled it with data.  Had less than 50% usage when a disk died.

No problem, it's ZFS, it's meant to be easy to replace a drive, just
offline, swap, replace, wait for it to resilver.

Well, 3 days later, it was still under 10%, and every disk light was
still solid grrn.  SNMP showed over 100 MB/s of disk I/O continuously,
and the box was basically unusable (5 minutes to get the password line
to appear on the console).

Tried rebooting a few times, stopped all disk I/O to the machine (it
was our backups box, running rysnc every night for - at the time - 50+
remote servers), let it do its thing.

After 3 weeks of trying to get the resilver to complete (or even reach
50%), we pulled the plug and destroyed the pool, rebuilding it using
3x 8-drive raidz2 vdevs.  Things have been a lot smoother ever since.
Have replaced 8 of the drives (1 vdev) with 1.5 TB drives.  Have
replaced multiple dead drives.  Resilvers, while running outgoing
rsync all day and incoming rsync all night, take 3 days for a 1.5 TB
drive (with SNMP showing 300 MB/s disk I/O).

You most definitely do not want to use a single super-wide raidz vdev.
  It just won't work.


Instead of the Best Practices Guide saying Don't put more than ___ disks
into a single vdev, the BPG should say Avoid the bus bandwidth bottleneck
by constructing your vdev's using physical disks which are distributed
across multiple buses, as necessary per the speed of your disks and buses.

Yeah, I still don't buy it.  Even spreading disks out such that you
have 4 SATA drives per PCI-X/PCIe bus, I don't think you'd be able to
get a 500 GB SATA disk to resilver in a 24-disk raidz vdev (even a
raidz1) in a 50% full pool.  Especially if you are using the pool for
anything at the same time.




the thing that folks tend to forget is that RaidZ is IOPS limited.  For 
the most part, if I want to reconstruct a single slab (stripe) of data, 
I have to issue a read to EACH disk in the vdev, and wait for that disk 
to return the value, before I can write the computed parity value out to 
the disk under reconstruction.


This is *regardless* of the amount of data being reconstructed.

So, the bottleneck tends to be the IOPS value of the single disk being 
reconstructed.  Thus, having fewer disks in a vdev leads to less data 
being required to be resilvered, which leads to fewer IOPS being 
required to finish the resilver.



Example (for ease of calculation, let's do the disk-drive mfg's cheat of 
1k = 1000 bytes):


Scenario 1:I have 5 1TB disks in a raidz1, and I assume I have 128k 
slab sizes.  Thus, I have 32k of data for each slab written to each 
disk. (4x32k data + 32k parity for a 128k slab size).  So, each IOPS 
gets to reconstruct 32k of data on the failed drive.   It thus takes 
about 1TB/32k = 31e6 IOPS to reconstruct the full 1TB drive.


Scenario 2:I have 10 1TB drives in a raidz1, with the same 128k slab 
sizes.  In this case, there's only about 14k of data on each drive for a 
slab. This means, each IOPS to the failed drive only write 14k.  So, it 
takes 1TB/14k = 71e6 IOPS to complete.



From this, it can be pretty easy to see that the number of required 
IOPS to the resilvered disk goes up linearly with the number of data 
drives in a vdev.  Since you're always going to be IOPS bound by the 
single disk resilvering, you have a fixed limit.


In addition, remember that having more disks means you have to wait 
longer for each IOPS to complete.  That is, it takes longer 
(fractionally, but in the aggregate, a measuable amount) for 9 drives to 
each return 14k of info than it does for 4 drives to return 32k of 
data.  This is due to rotational and seek access delays.  So, not only 
are you having to do more total IOPS in Scenario 2, but each IOPS takes 
longer to complete (the read cycle taking longer, the write/reconstruct 
cycle taking the same amount of time).




--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Erik Trimble

 On 9/9/2010 2:15 AM, taemun wrote:
Erik: does that mean that keeping the number of data drives in a 
raidz(n) to a power of two is better? In the example you gave, you 
mentioned 14kb being written to each drive. That doesn't sound very 
efficient to me.


(when I say the above, I mean a five disk raidz or a ten disk raidz2, etc)

Cheers,



Well, since the size of a slab can vary (from 512 bytes to 128k), it's 
hard to say. Length (size) of the slab is likely the better 
determination. Remember each block on a hard drive is 512 bytes (for 
now).  So, it's really not any more efficient to write 16k than 14k (or 
vice versa). Both are integer multiples of 512 bytes.


IIRC, there was something about using a power-of-two number of data 
drives in a RAIDZ, but I can't remember what that was. It may just be a 
phantom memory.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] performance leakage when copy huge data

2010-09-09 Thread Tomas Ögren
On 08 September, 2010 - Fei Xu sent me these 5,9K bytes:

 I dig deeper into it and might find some useful information.
 I attached an X25 SSD for ZIL to see if it helps.  but no luck.
 I run IOstate -xnz for more details and got interesting result as 
 below.(maybe too long)
 some explaination:
 1. c2d0 is SSD for ZIL
 2. c0t3d0, c0t20d0, c0t21d0, c0t22d0 is source pool.
...
 extended device statistics  
 r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 0.30.01.20.0  0.0  0.00.00.1   0   0 c2d0
 0.1   17.70.1   51.7  0.0  0.10.24.1   0   7 c3d0
 0.12.10.0   79.8  0.0  0.00.14.0   0   0 c0t2d0
 0.20.07.10.0  0.1  2.3  278.5 11365.1   1  46 c0t3d0

Service time here is crap. 11 seconds to reply.

 0.12.20.0   79.9  0.0  0.00.13.7   0   0 c0t5d0
 0.12.30.0   80.0  0.0  0.00.19.2   0   0 c0t6d0
 0.12.50.0   80.1  0.0  0.00.13.8   0   0 c0t10d0
 0.12.40.0   80.0  0.0  0.00.19.5   0   0 c0t11d0
 1.90.0  133.00.0  0.1  2.8   60.2 1520.6   2  51 c0t20d0

1.5 seconds to reply. crap.

 extended device statistics  
 r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
...
 0.70.0   39.10.0  0.0  0.6   64.0  884.1   1  10 c0t3d0
...
 2.10.0  135.80.0  0.1  5.2   67.8 2498.1   3  88 c0t21d0
...
 extended device statistics  
 r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
...
 3.50.0  246.80.0  0.0  0.86.3  229.8   1  20 c0t3d0
...
 0.70.0   29.20.0  0.0  0.60.0  911.0   0  12 c0t21d0
 1.90.0  138.70.0  0.1  4.7   73.0 2428.6   2  66 c0t22d0
...

Service times here are crap. Disks are malfunctioning in some way. If
your source disks can take seconds (or 10+ seconds) to reply, then of
course your copy will be slow. Disk is probably having a hard time
reading the data or something.

/Tomas
-- 
Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] performance leakage when copy huge data

2010-09-09 Thread Fei Xu
 
 Service times here are crap. Disks are malfunctioning
 in some way. If
 your source disks can take seconds (or 10+ seconds)
 to reply, then of
 course your copy will be slow. Disk is probably
 having a hard time
 reading the data or something.
 


Yeah, that should not go over 15ms.  I just cannot understand why it starts ok 
with hundred GB files transfered and then suddenly fall to sleep.
by the way,  WDIDLE time is already disabled which might cause some issue.  
I've changed to another system to test ZFS send between 8*1TB pool and 4*1TB 
pool.  hope everythings OK on this case.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread hatish
Very interesting...

Well, lets see if we can do the numbers for my setup.

From a previous post of mine:

[i]This is my exact breakdown (cheap disks on cheap bus :P) :

PCI-E 8X 4-port ESata Raid Controller.
4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the 
controller).
20 x Samsung 1TB HDD's. (each connected to a Port Multiplier).

The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each ESata 
port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each PM can 
give up to 3Gbps, which is shared amongst 5 drives. According to Samsungs site, 
max read speed is 250MBps, which translates to 2Gbps. Multiply by 5 drives 
gives you 10Gbps. Which is 333% of the PM's capability. So the drives arent 
likely to hit max read speed for long lengths of time, especially during 
rebuild time.

So the bus is going to be quite a bottleneck. Lets assume that the drives are 
80% full. Thats 800GB that needs to be read on each drive, which is (800x9) 
7.2TB.
Best case scenario, we can read 7.2TB at 3Gbps
= 57.6 Tb at 3Gbps
= 57600 Gb at 3Gbps
= 19200 seconds
= 320 minutes
= 5 Hours 20 minutes.

Even if it takes twice that amount of time, Im happy.

Initially I had been thinking 2 PM's for each vdev. But now Im thinking maybe 
split it wide as best I can ([2Ddisks per PM] x 2, [2Ddisks1Pdisk per PM] x 2) 
for each vdev. It'll give the best possible speed, but still wont max out the 
HDD's.

I've never actually sat and done the math before. Hope its decently accurate 
:)[/i]

My scenario, as from Erik's post:
Scenario: I have 10 1TB disks in a raidz2, and I have 128k
slab sizes. Thus, I have 16k of data for each slab written to each
disk. (8x16k data + 32k parity for a 128k slab size). So, each IOPS
gets to reconstruct 16k of data on the failed drive. It thus takes
about 1TB/16k = 62.5e6 IOPS to reconstruct the full 1TB drive.

Lets assume the drives are at 95% capacity, which is a pretty bad scenario. So 
thats 7600GB, which is 60800Gb. There will be no other IO while a rebuild is 
going.
Best Case: I'll read at 12Gbps,  write at 3Gbps (4:1). I read 128K for every 
16K I write (8:1). Hence the read bandwidth will be the bottleneck. So 60800Gb 
@ 12Gbps is 5066s which is 84m27s (Never gonna happen). A more realistic read 
of 1.5Gbps gives me 40533s, which is 675m33s, which is 11h15m33s. Which is a 
more realistic time to read 7.6TB.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Freddie Cash
 
 No, it (21-disk raidz3 vdev) most certainly will not resilver in the
 same amount of time.  In fact, I highly doubt it would resilver at
 all.
 
 My first foray into ZFS resulted in a 24-disk raidz2 vdev using 500 GB
 Seagate ES.2 and WD RE3 drives connected to 3Ware 9550SXU and 9650SE
 multilane controllers.  Nice 10 TB storage pool.  Worked beatifully as
 we filled it with data.  Had less than 50% usage when a disk died.
 
 No problem, it's ZFS, it's meant to be easy to replace a drive, just
 offline, swap, replace, wait for it to resilver.
 
 Well, 3 days later, it was still under 10%, and every disk light was
 still solid grrn.  SNMP showed over 100 MB/s of disk I/O continuously,

I don't believe your situation is typical.  I think you either encountered a 
bug, or you had something happening that you weren't aware of (scrub, 
autosnapshots, etc) ... because the only time I've ever seen anything remotely 
similar to the behavior you described was the bug I've mentioned in other 
emails, which occurs when disk is 100% full and a scrub is taking place.

I know it's not the same bug for you, because you said your pool was only 50% 
full.  But I don't believe that what you saw was normal or typical.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Erik Trimble

 On 9/9/2010 5:49 AM, hatish wrote:

Very interesting...

Well, lets see if we can do the numbers for my setup.

 From a previous post of mine:

[i]This is my exact breakdown (cheap disks on cheap bus :P) :

PCI-E 8X 4-port ESata Raid Controller.
4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the 
controller).
20 x Samsung 1TB HDD's. (each connected to a Port Multiplier).

The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each ESata 
port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each PM can 
give up to 3Gbps, which is shared amongst 5 drives. According to Samsungs site, 
max read speed is 250MBps, which translates to 2Gbps. Multiply by 5 drives 
gives you 10Gbps. Which is 333% of the PM's capability. So the drives arent 
likely to hit max read speed for long lengths of time, especially during 
rebuild time.

So the bus is going to be quite a bottleneck. Lets assume that the drives are 
80% full. Thats 800GB that needs to be read on each drive, which is (800x9) 
7.2TB.
Best case scenario, we can read 7.2TB at 3Gbps
= 57.6 Tb at 3Gbps
= 57600 Gb at 3Gbps
= 19200 seconds
= 320 minutes
= 5 Hours 20 minutes.

Even if it takes twice that amount of time, Im happy.

Initially I had been thinking 2 PM's for each vdev. But now Im thinking maybe split 
it wide as best I can ([2Ddisks per PM] x 2, [2Ddisks1Pdisk per PM] x 2) for 
each vdev. It'll give the best possible speed, but still wont max out the HDD's.

I've never actually sat and done the math before. Hope its decently accurate 
:)[/i]

My scenario, as from Erik's post:
Scenario: I have 10 1TB disks in a raidz2, and I have 128k
slab sizes. Thus, I have 16k of data for each slab written to each
disk. (8x16k data + 32k parity for a 128k slab size). So, each IOPS
gets to reconstruct 16k of data on the failed drive. It thus takes
about 1TB/16k = 62.5e6 IOPS to reconstruct the full 1TB drive.

Lets assume the drives are at 95% capacity, which is a pretty bad scenario. So 
thats 7600GB, which is 60800Gb. There will be no other IO while a rebuild is 
going.
Best Case: I'll read at 12Gbps,  write at 3Gbps (4:1). I read 128K for every 
16K I write (8:1). Hence the read bandwidth will be the bottleneck. So 60800Gb @ 
12Gbps is 5066s which is 84m27s (Never gonna happen). A more realistic read of 
1.5Gbps gives me 40533s, which is 675m33s, which is 11h15m33s. Which is a more 
realistic time to read 7.6TB.



Actually, your biggest bottleneck will be the IOPS limits of the 
drives.  A 7200RPM SATA drive tops out at 100 IOPS.  Yup. That's it.


So, if you need to do 62.5e6 IOPS, and the rebuild drive can do just 100 
IOPS, that means you will finish (best case) in 62.5e4 seconds.  Which 
is over 173 hours. Or, about 7.25 WEEKS.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Erik Trimble

 On 9/9/2010 5:49 AM, hatish wrote:

Very interesting...

Well, lets see if we can do the numbers for my setup.

 From a previous post of mine:

[i]This is my exact breakdown (cheap disks on cheap bus :P) :

PCI-E 8X 4-port ESata Raid Controller.
4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the 
controller).
20 x Samsung 1TB HDD's. (each connected to a Port Multiplier).

The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each ESata 
port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each PM can 
give up to 3Gbps, which is shared amongst 5 drives. According to Samsungs site, 
max read speed is 250MBps, which translates to 2Gbps. Multiply by 5 drives 
gives you 10Gbps. Which is 333% of the PM's capability. So the drives arent 
likely to hit max read speed for long lengths of time, especially during 
rebuild time.

So the bus is going to be quite a bottleneck. Lets assume that the drives are 
80% full. Thats 800GB that needs to be read on each drive, which is (800x9) 
7.2TB.
Best case scenario, we can read 7.2TB at 3Gbps
= 57.6 Tb at 3Gbps
= 57600 Gb at 3Gbps
= 19200 seconds
= 320 minutes
= 5 Hours 20 minutes.

Even if it takes twice that amount of time, Im happy.

Initially I had been thinking 2 PM's for each vdev. But now Im thinking maybe split 
it wide as best I can ([2Ddisks per PM] x 2, [2Ddisks1Pdisk per PM] x 2) for 
each vdev. It'll give the best possible speed, but still wont max out the HDD's.

I've never actually sat and done the math before. Hope its decently accurate 
:)[/i]

My scenario, as from Erik's post:
Scenario: I have 10 1TB disks in a raidz2, and I have 128k
slab sizes. Thus, I have 16k of data for each slab written to each
disk. (8x16k data + 32k parity for a 128k slab size). So, each IOPS
gets to reconstruct 16k of data on the failed drive. It thus takes
about 1TB/16k = 62.5e6 IOPS to reconstruct the full 1TB drive.

Lets assume the drives are at 95% capacity, which is a pretty bad scenario. So 
thats 7600GB, which is 60800Gb. There will be no other IO while a rebuild is 
going.
Best Case: I'll read at 12Gbps,  write at 3Gbps (4:1). I read 128K for every 
16K I write (8:1). Hence the read bandwidth will be the bottleneck. So 60800Gb @ 
12Gbps is 5066s which is 84m27s (Never gonna happen). A more realistic read of 
1.5Gbps gives me 40533s, which is 675m33s, which is 11h15m33s. Which is a more 
realistic time to read 7.6TB.



Actually, your biggest bottleneck will be the IOPS limits of the 
drives.  A 7200RPM SATA drive tops out at 100 IOPS.  Yup. That's it.


So, if you need to do 62.5e6 IOPS, and the rebuild drive can do just 100 
IOPS, that means you will finish (best case) in 62.5e4 seconds.  Which 
is over 173 hours. Or, about 7.25 WEEKS.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Will Murnane
On Thu, Sep 9, 2010 at 09:03, Erik Trimble erik.trim...@oracle.com wrote:
 Actually, your biggest bottleneck will be the IOPS limits of the drives.  A
 7200RPM SATA drive tops out at 100 IOPS.  Yup. That's it.

 So, if you need to do 62.5e6 IOPS, and the rebuild drive can do just 100
 IOPS, that means you will finish (best case) in 62.5e4 seconds.  Which is
 over 173 hours. Or, about 7.25 WEEKS.
No argument on IOPS, but 173 hours is 7 days, or a little over one week.

Will
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Erik Trimble
 
 the thing that folks tend to forget is that RaidZ is IOPS limited.  For
 the most part, if I want to reconstruct a single slab (stripe) of data,
 I have to issue a read to EACH disk in the vdev, and wait for that disk
 to return the value, before I can write the computed parity value out
 to
 the disk under reconstruction.

If I'm trying to interpret your whole message, Erik, and condense it, I
think I get the following.  Please tell me if and where I'm wrong.

In any given zpool, some number of slabs are used in the whole pool.  In
raidzN, a portion of each slab is written on each disk.  Therefore, during
resilver, if there are a total of 1million slabs used in the zpool, it means
each good disk will need to read 1million partial slabs, and the replaced
disk will need to write 1 million partial slabs.  Each good disk receives a
read request in parallel, and all of them must complete before a write is
given to the new disk.  Each read/write cycle is completed before the next
cycle begins.  (It seems this could be accelerated by allowing all the good
disks to continue reading in parallel instead of waiting, right?)

The conclusion I would reach is:

Given no bus bottleneck:

It is true that resilvering a raidz will be slower with many disks in the
vdev, because the average latency for the worst of N disks will increase as
N increases.  But that effect is only marginal, and bounded between the
average latency of a single disk, and the worst case latency of a single
disk.

The characteristic that *really* makes a big difference is the number of
slabs in the pool.  i.e. if your filesystem is composed of mostly small
files or fragments, versus mostly large unfragmented files.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Edward Ned Harvey
 From: Hatish Narotam [mailto:hat...@gmail.com]
 
 PCI-E 8X 4-port ESata Raid Controller.
 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on
 the controller).
 20 x Samsung 1TB HDD's. (each connected to a Port Multiplier).

Assuming your disks can all sustain 500Mbit/sec, which I find to be typical
for 7200rpm sata disks, and you have groups of 5 that all have a 3Gbit
upstream bottleneck, it means each of your groups of 5 should be fine in a
raidz1 configuration.

You think that your sata card can do 32Gbit because it's on a PCIe x8 bus.
I highly doubt it unless you paid a grand or two for your sata controller,
but please prove me wrong.  ;-)  I think the backplane of the sata
controller is more likely either 3G or 6G.  

If it's 3G, then you should use 4 groups of raidz1.
If it's 6G, then you can use 2 groups of raidz2 (because 10 drives of
500Mbit can only sustain 5Gbit)
If it's 12G or higher, then you can make all of your drives one big vdev of
raidz3.


 According to Samsungs site, max read speed is 250MBps, which
 translates to 2Gbps. Multiply by 5 drives gives you 10Gbps.

I guarantee you this is not a sustainable speed for 7.2krpm sata disks.  You
can get a decent measure of sustainable speed by doing something like:
(write 1G byte)
time dd if=/dev/zero of=/some/file bs=1024k count=1024
(beware: you might get an inaccurate speed measurement here
due to ram buffering.  See below.)

(reboot to ensure nothing is in cache)
(read 1G byte)
time dd if=/some/file of=/dev/null bs=1024k
(Now you're certain you have a good measurement.
If it matches the measurement you had before,
that means your original measurement was also
accurate.  ;-) )

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 
 The characteristic that *really* makes a big difference is the number
 of
 slabs in the pool.  i.e. if your filesystem is composed of mostly small
 files or fragments, versus mostly large unfragmented files.

Oh, if at least some of my reasoning was correct, there is one valuable
take-away point for hatish:

Given some number X total slabs used in the whole pool.  If you use a single
vdev for the whole pool, you will have X partial slabs written on each disk.
If you have 2 vdev's, you'll have approx X/2 partial slabs written on each
disk.  3 vdevs ~ X/3 partial slabs on each disk.  Therefore, the resilver
time approximately divides by the number of separate vdev's you are using in
your pool.

So the largest factor affecting resilver time of a single large vdev versus
many smaller vdev's is NOT the quantity of data written on each disk, but
just the fact that fewer slabs are used on each disk when using smaller
vdev's.

If you want to choose between (a) 21disk raidz3 versus (b) 3 vdevs of each
7disk raidz1, then:  The raidz3 provides better redundancy, but has the
disadvantage that every slab must be partially written on every disk.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Marty Scholes
Erik wrote:
 Actually, your biggest bottleneck will be the IOPS
 limits of the 
 drives.  A 7200RPM SATA drive tops out at 100 IOPS.
  Yup. That's it.
 So, if you need to do 62.5e6 IOPS, and the rebuild
 drive can do just 100 
 IOPS, that means you will finish (best case) in
 62.5e4 seconds.  Which 
 is over 173 hours. Or, about 7.25 WEEKS.

My OCD is coming out and I will split that hair with you.  173 hours is just 
over a week.

This is a fascinating and timely discussion.  My personal (biased and 
unhindered by facts) preference is wide stripes RAIDZ3.  Ned is right that I 
kept reading that RAIDZx should not exceed _ devices and couldn't find real 
numbers behind those conclusions.

Discussions in this thread have opened my eyes a little and I am in the middle 
of deploying a second 22 disk fibre array on home server, so I have been 
struggling with the best way to allocate pools.  Up until reading this thread, 
the biggest downside to wide stripes, that I was aware of, has been low iops.  
And let's be clear: while on paper the iops of a wide stripe is the same as a 
single disk, it actually is worse.  In truth, the service time for any request 
on wide stripe is the service time of the SLOWEST disk for that request.  The 
slowest disk may vary from request to request, but will always delay the entire 
stripe operation.

Since all of the 44 spindles are 15K disks, I am about to convince myself to go 
with two pools of wide stripes and keep several spindles for L2ARC and SLOG.  
The thinking is that other background operations (scrub and resilver) can take 
place with little impact to application performance, since those will be using 
L2ARC and SLOG.

Of course, I could be wrong on any of the above.

Cheers,
Marty
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] performance leakage when copy huge data

2010-09-09 Thread Ross Walker
On Sep 9, 2010, at 8:27 AM, Fei Xu twinse...@hotmail.com wrote:

 
 Service times here are crap. Disks are malfunctioning
 in some way. If
 your source disks can take seconds (or 10+ seconds)
 to reply, then of
 course your copy will be slow. Disk is probably
 having a hard time
 reading the data or something.
 
 
 
 Yeah, that should not go over 15ms.  I just cannot understand why it starts 
 ok with hundred GB files transfered and then suddenly fall to sleep.
 by the way,  WDIDLE time is already disabled which might cause some issue.  
 I've changed to another system to test ZFS send between 8*1TB pool and 4*1TB 
 pool.  hope everythings OK on this case.

This might be the dreaded WD TLER issue. Basically the drive keeps retrying a 
read operation over and over after a bit error trying to recover from a read 
error themselves. With ZFS one really needs to disable this and have the drives 
fail immediately.

Check your drives to see if they have this feature, if so think about replacing 
the drives in the source pool that have long service times and make sure this 
feature is disabled on the destination pool drives.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] performance leakage when copy huge data

2010-09-09 Thread Markus Kovero
On Sep 9, 2010, at 8:27 AM, Fei Xu twinse...@hotmail.com wrote:


 This might be the dreaded WD TLER issue. Basically the drive keeps retrying a 
 read operation over and over after a bit error trying to recover from a  
 read error themselves. With ZFS one really needs to disable this and have the 
 drives fail immediately.

 Check your drives to see if they have this feature, if so think about 
 replacing the drives in the source pool that have long service times and make 
 sure this feature is disabled on the destination pool drives.

 -Ross


It might be due tler-issues, but I'd try to pin greens down to SATA1-mode (use 
jumper, or force via controller). It might help a bit with these disks, 
although these are not really suitable disks for any use in any raid 
configurations due tler issue, which cannot be disabled in later firmware 
versions.

Yours
Markus Kovero

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool create using whole disk - do I add p0? E.g. c4t2d0 or c42d0p0

2010-09-09 Thread Cindy Swearingen

Hi--

It might help to review the disk component terminology description:

c#t#d#p# = represents the the fdisk partition on x86 systems, where
you can have up to 4 fdisk partitions, such as one for the Solaris
OS or a Windows OS. An fdisk partition is the larger container of the
disk or disk slices.

c#t#d# = represents the whole disk.

c#t#d#s# = represents the disk slice, used for the root pool because
the current boot limitation that says we must boot from a slice.

The issue is that if you don't understand that the c#t#d#p# device
contains the c#t#d# or c#t#d#s# devices, you might create a pool
that contains p#, d#, and s# components, in an overlapping kind of
way (we've seen it). A bug exists to prevent pool creation with p#
devices.

You are probably okay if you use c0t0d0p0 and c0t1d0p0 and never
overlap the fdisk components but we don't test this configuration
and its not supported.

Thanks,

Cindy

On 09/08/10 23:07, R.G. Keen wrote:

Hi Craig,
Don't use the p* devices for your storage pools. They
represent the larger fdisk partition.

Use the d* devices instead, like this example below:


Good advice, something I wondered about too. 


However, aside from my having guessed right once (I think...) I have no clue 
why this should be. Can you expound a bit on the reasoning behind this advice?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] NFS performance near zero on a very full pool

2010-09-09 Thread Arne Jansen
Hi,

currently I'm trying to debug a very strange phenomenon on a nearly full
pool (96%). Here are the symptoms: over NFS, a find on the pool takes
a very long time, up to 30s (!) for each file. Locally, the performance
is quite normal.
What I found out so far: It seems that every nfs write (rfs3_write) blocks
until the txg is flushed. This means a write takes up to 30 seconds. During
this time, the nfs calls block, occupying all NFS server threads. With all
server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait
until the writes finish, bringing the performance of the server effectively
down to zero.
It may be that the trigger for this behavior is around 95%. I managed to bring
the pool down to 95%, now the writes get served continuously as it should be.

What is the explanation for this behaviour? Is it intentional and can the
threshold be tuned? I experienced this on Sol10 U8.

Thanks,
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] performance leakage when copy huge data

2010-09-09 Thread Mark Little
On Thu, 9 Sep 2010 14:05:51 +, Markus Kovero
markus.kov...@nebula.fi wrote:
 On Sep 9, 2010, at 8:27 AM, Fei Xu twinse...@hotmail.com wrote:
 
 
 This might be the dreaded WD TLER issue. Basically the drive keeps retrying 
 a read operation over and over after a bit error trying to recover from a  
 read error themselves. With ZFS one really needs to disable this and have 
 the drives fail immediately.
 
 Check your drives to see if they have this feature, if so think about 
 replacing the drives in the source pool that have long service times and 
 make sure this feature is disabled on the destination pool drives.
 
 -Ross
 
 
 It might be due tler-issues, but I'd try to pin greens down to
 SATA1-mode (use jumper, or force via controller). It might help a bit
 with these disks, although these are not really suitable disks for any
 use in any raid configurations due tler issue, which cannot be
 disabled in later firmware versions.
 
 Yours
 Markus Kovero
 

Just to clarify - do you mean TLER should be off or on?  TLER = Time
Limited Error Recovery so the drive only takes a max time (eg: 7
seconds) to retrieve data or returns an error.  So you say 'cannot be
disabled' but I think you mean 'cannot be ENABLED' ?

I've been doing a lot of research for a new storage box at work, and
from reading a lot of the info available in the Storage forum on
hardforum.com, the experts there seem to recommend NOT having TLER
enabled when using ZFS as ZFS can be configured for its timeouts, etc,
and the main reason to use TLER is when using those drives with hardware
RAID cards which will kick a drive out of the array if it takes longer
than 10 seconds.

Can anyone else here comment if they have had experience with the WD
drives and ZFS and if they have TLER enabled or disabled?

Cheers,
Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS performance near zero on a very full pool

2010-09-09 Thread Neil Perrin

Arne,

NFS often demands it's transactions are stable before returning.
This forces ZFS to do the system call synchronously. Usually the
ZIL (code) allocates and writes a new block in the intent log chain to 
achieve this.

If ever it fails to allocate a block (of the size requested) it it forced
to close the txg containing the system call. Yes this can be extremely
slow but there is no other option for the ZIL. I'm surprised the wait is 
30 seconds.
I would expect mush less, but finding room for the rest of the txg data 
and metadata

would also be a challenge.

Most (maybe all?) file systems perform badly when out of space. I 
believe we give a recommended

free size and I thought it was 90%.

Neil.

On 09/09/10 09:00, Arne Jansen wrote:

Hi,

currently I'm trying to debug a very strange phenomenon on a nearly full
pool (96%). Here are the symptoms: over NFS, a find on the pool takes
a very long time, up to 30s (!) for each file. Locally, the performance
is quite normal.
What I found out so far: It seems that every nfs write (rfs3_write) blocks
until the txg is flushed. This means a write takes up to 30 seconds. During
this time, the nfs calls block, occupying all NFS server threads. With all
server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait
until the writes finish, bringing the performance of the server effectively
down to zero.
It may be that the trigger for this behavior is around 95%. I managed to bring
the pool down to 95%, now the writes get served continuously as it should be.

What is the explanation for this behaviour? Is it intentional and can the
threshold be tuned? I experienced this on Sol10 U8.

Thanks,
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] NetApp/Oracle-Sun lawsuit done

2010-09-09 Thread David Magda
Seems that things have been cleared up:

 NetApp (NASDAQ: NTAP) today announced that both parties have agreed to
 dismiss their pending patent litigation, which began in 2007 between Sun
 Microsystems and NetApp. Oracle and NetApp seek to have the lawsuits
 dismissed without prejudice. The terms of the agreement are confidential.

http://tinyurl.com/39qkzgz
http://www.netapp.com/us/company/news/news-rel-20100909-oracle-settlement.html

A recap of the history at:

http://www.theregister.co.uk/2010/09/09/oracle_netapp_zfs_dismiss/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS performance near zero on a very full pool

2010-09-09 Thread Neil Perrin

I should also have mentioned that if the pool has a separate log device
then this shouldn't happen.Assuming the slog is big enough then it
it should have enough blocks to not be forced into using main pool 
device blocks.


Neil.

On 09/09/10 10:36, Neil Perrin wrote:

Arne,

NFS often demands it's transactions are stable before returning.
This forces ZFS to do the system call synchronously. Usually the
ZIL (code) allocates and writes a new block in the intent log chain to 
achieve this.

If ever it fails to allocate a block (of the size requested) it it forced
to close the txg containing the system call. Yes this can be extremely
slow but there is no other option for the ZIL. I'm surprised the wait 
is 30 seconds.
I would expect mush less, but finding room for the rest of the txg 
data and metadata

would also be a challenge.

Most (maybe all?) file systems perform badly when out of space. I 
believe we give a recommended

free size and I thought it was 90%.

Neil.

On 09/09/10 09:00, Arne Jansen wrote:

Hi,

currently I'm trying to debug a very strange phenomenon on a nearly full
pool (96%). Here are the symptoms: over NFS, a find on the pool takes
a very long time, up to 30s (!) for each file. Locally, the performance
is quite normal.
What I found out so far: It seems that every nfs write (rfs3_write) 
blocks
until the txg is flushed. This means a write takes up to 30 seconds. 
During
this time, the nfs calls block, occupying all NFS server threads. 
With all
server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to 
wait
until the writes finish, bringing the performance of the server 
effectively

down to zero.
It may be that the trigger for this behavior is around 95%. I managed 
to bring
the pool down to 95%, now the writes get served continuously as it 
should be.


What is the explanation for this behaviour? Is it intentional and can 
the

threshold be tuned? I experienced this on Sol10 U8.

Thanks,
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS performance near zero on a very full pool

2010-09-09 Thread Arne Jansen

Hi Neil,

Neil Perrin wrote:

NFS often demands it's transactions are stable before returning.
This forces ZFS to do the system call synchronously. Usually the
ZIL (code) allocates and writes a new block in the intent log chain to 
achieve this.

If ever it fails to allocate a block (of the size requested) it it forced
to close the txg containing the system call. Yes this can be extremely
slow but there is no other option for the ZIL. I'm surprised the wait is 
30 seconds.
I would expect mush less, but finding room for the rest of the txg data 
and metadata

would also be a challenge.


I think this is not what we saw, for two reason:
 a) we have a mirrored slog device. According to zpool iostat -v only 16MB
out of 4GB were in use.
 b) it didn't seem like the txg would have been closed early. Rather it kept
approximately the 30 second intervals.

Internally we came up with a different explanation, without any backing that
it might be correct: When the pool reaches 96%, zfs goes into a 'self defense'
mode. Instead of allocating block from ZIL, every write turns synchronous and
has to wait for the txg to finish naturally. The reasoning behind this might
be that even if ZIL is available, there might not be enough space left to commit
the ZIL to the pool. To prevent this, zfs doen't use ZIL when the pool is above
96%. While this might be proper for small pools, on large pools 4% are still
some TB of free space, so there should be an upper limit of maybe 10GB on this
hidden reserve.
Also this sudden switch of behavior is completely unexpected and at least under-
documented.



Most (maybe all?) file systems perform badly when out of space. I 
believe we give a recommended

free size and I thought it was 90%.


In this situation, not only writes suffered, but as a side effect reads also
came to a nearly complete halt.

--
Arne




Neil.

On 09/09/10 09:00, Arne Jansen wrote:

Hi,

currently I'm trying to debug a very strange phenomenon on a nearly full
pool (96%). Here are the symptoms: over NFS, a find on the pool takes
a very long time, up to 30s (!) for each file. Locally, the performance
is quite normal.
What I found out so far: It seems that every nfs write (rfs3_write) 
blocks
until the txg is flushed. This means a write takes up to 30 seconds. 
During
this time, the nfs calls block, occupying all NFS server threads. With 
all

server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait
until the writes finish, bringing the performance of the server 
effectively

down to zero.
It may be that the trigger for this behavior is around 95%. I managed 
to bring
the pool down to 95%, now the writes get served continuously as it 
should be.


What is the explanation for this behaviour? Is it intentional and can the
threshold be tuned? I experienced this on Sol10 U8.

Thanks,
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done

2010-09-09 Thread Richard Elling
This is welcome news.
 -- richard

On Sep 9, 2010, at 9:38 AM, David Magda wrote:

 Seems that things have been cleared up:
 
 NetApp (NASDAQ: NTAP) today announced that both parties have agreed to
 dismiss their pending patent litigation, which began in 2007 between Sun
 Microsystems and NetApp. Oracle and NetApp seek to have the lawsuits
 dismissed without prejudice. The terms of the agreement are confidential.
 
 http://tinyurl.com/39qkzgz
 http://www.netapp.com/us/company/news/news-rel-20100909-oracle-settlement.html
 
 A recap of the history at:
 
 http://www.theregister.co.uk/2010/09/09/oracle_netapp_zfs_dismiss/

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
ZFS and performance consulting
http://www.RichardElling.com












___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done

2010-09-09 Thread Erik Trimble

 On 9/9/2010 10:25 AM, Richard Elling wrote:

This is welcome news.
  -- richard

On Sep 9, 2010, at 9:38 AM, David Magda wrote:


Seems that things have been cleared up:


NetApp (NASDAQ: NTAP) today announced that both parties have agreed to
dismiss their pending patent litigation, which began in 2007 between Sun
Microsystems and NetApp. Oracle and NetApp seek to have the lawsuits
dismissed without prejudice. The terms of the agreement are confidential.

http://tinyurl.com/39qkzgz
http://www.netapp.com/us/company/news/news-rel-20100909-oracle-settlement.html

A recap of the history at:

http://www.theregister.co.uk/2010/09/09/oracle_netapp_zfs_dismiss

Yes, it's welcome to get it over with.

I do get to bitch about one aspect here of the US civil legal system, 
though.  If you've gone so far as to burn our (the public's) time and 
money to file a lawsuit, you shouldn't be able to seal up the court 
transcript, or have a non-public settlement.  Call it the price you pay 
for wasting our time (i.e. the court system's time).


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS performance near zero on a very full pool

2010-09-09 Thread Richard Elling
On Sep 9, 2010, at 10:09 AM, Arne Jansen wrote:

 Hi Neil,
 
 Neil Perrin wrote:
 NFS often demands it's transactions are stable before returning.
 This forces ZFS to do the system call synchronously. Usually the
 ZIL (code) allocates and writes a new block in the intent log chain to 
 achieve this.
 If ever it fails to allocate a block (of the size requested) it it forced
 to close the txg containing the system call. Yes this can be extremely
 slow but there is no other option for the ZIL. I'm surprised the wait is 30 
 seconds.
 I would expect mush less, but finding room for the rest of the txg data and 
 metadata
 would also be a challenge.
 
 I think this is not what we saw, for two reason:
 a) we have a mirrored slog device. According to zpool iostat -v only 16MB
out of 4GB were in use.
 b) it didn't seem like the txg would have been closed early. Rather it kept
approximately the 30 second intervals.
 
 Internally we came up with a different explanation, without any backing that
 it might be correct: When the pool reaches 96%, zfs goes into a 'self defense'
 mode. Instead of allocating block from ZIL, every write turns synchronous and
 has to wait for the txg to finish naturally. The reasoning behind this might
 be that even if ZIL is available, there might not be enough space left to 
 commit
 the ZIL to the pool. To prevent this, zfs doen't use ZIL when the pool is 
 above
 96%. While this might be proper for small pools, on large pools 4% are still
 some TB of free space, so there should be an upper limit of maybe 10GB on this
 hidden reserve.

I do not believe this is correct.  At 96% the first-fit algorithm changes to 
best-fit
and ganging can be expected. This has nothing to do with the ZIL.  There is 
already a reserve set aside for metadata and the ZIL so that you can remove
files when the file system is 100% full.  This reserve is 32 MB or 1/64 of the 
pool
size.

 Also this sudden switch of behavior is completely unexpected and at least 
 under-
 documented.

Methinks you are just seeing the change in performance from the allocation
algorithm change.

 
 Most (maybe all?) file systems perform badly when out of space. I believe we 
 give a recommended
 free size and I thought it was 90%.
 
 In this situation, not only writes suffered, but as a side effect reads also
 came to a nearly complete halt.

If you have atime=on, then reads create writes.
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
ZFS and performance consulting
http://www.RichardElling.com












___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS performance near zero on a very full pool

2010-09-09 Thread Arne Jansen

Richard Elling wrote:

On Sep 9, 2010, at 10:09 AM, Arne Jansen wrote:


Hi Neil,

Neil Perrin wrote:

NFS often demands it's transactions are stable before returning.
This forces ZFS to do the system call synchronously. Usually the
ZIL (code) allocates and writes a new block in the intent log chain to achieve 
this.
If ever it fails to allocate a block (of the size requested) it it forced
to close the txg containing the system call. Yes this can be extremely
slow but there is no other option for the ZIL. I'm surprised the wait is 30 
seconds.
I would expect mush less, but finding room for the rest of the txg data and 
metadata
would also be a challenge.

I think this is not what we saw, for two reason:
a) we have a mirrored slog device. According to zpool iostat -v only 16MB
   out of 4GB were in use.
b) it didn't seem like the txg would have been closed early. Rather it kept
   approximately the 30 second intervals.

Internally we came up with a different explanation, without any backing that
it might be correct: When the pool reaches 96%, zfs goes into a 'self defense'
mode. Instead of allocating block from ZIL, every write turns synchronous and
has to wait for the txg to finish naturally. The reasoning behind this might
be that even if ZIL is available, there might not be enough space left to commit
the ZIL to the pool. To prevent this, zfs doen't use ZIL when the pool is above
96%. While this might be proper for small pools, on large pools 4% are still
some TB of free space, so there should be an upper limit of maybe 10GB on this
hidden reserve.


I do not believe this is correct.  At 96% the first-fit algorithm changes to 
best-fit
and ganging can be expected. This has nothing to do with the ZIL.  There is 
already a reserve set aside for metadata and the ZIL so that you can remove

files when the file system is 100% full.  This reserve is 32 MB or 1/64 of the 
pool
size.


Maybe it is some side-effect of this change of allocation scheme. But I'm very
sure about what I saw. The change was drastic and abrupt. I had a dtrace script
running that measured the time for rfs3_write to complete. With the pool 96%
I saw a burst of writes every 30 seconds, with completion times of up to 30s.
With the pool  96%, I saw a continuous stream of writes with completion times
of mostly a few microseconds.




In this situation, not only writes suffered, but as a side effect reads also
came to a nearly complete halt.


If you have atime=on, then reads create writes.


atime is off. The impact on reads/lookups/getattr came imho because all server
threads have been occupied by blocking writes for a prolonged time.

I'll try to reproduce this on a test machine.

--
Arne

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done

2010-09-09 Thread Bob Friesenhahn

On Thu, 9 Sep 2010, Erik Trimble wrote:

Yes, it's welcome to get it over with.

I do get to bitch about one aspect here of the US civil legal system, though. 
If you've gone so far as to burn our (the public's) time and money to file a 
lawsuit, you shouldn't be able to seal up the court transcript, or have a 
non-public settlement.  Call it the price you pay for wasting our time (i.e. 
the court system's time).


Unfortunately, this may just be a case of Oracle's patents vs NetApp's 
patents.  Oracle obviously holds a lot of patents and could 
counter-sue using one of its own patents.  Oracle's handshake 
agreement with NetApp does not in any way shield other zfs commercial 
users from a patent lawsuit from NetApp.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] performance leakage when copy huge data

2010-09-09 Thread Miles Nordin
 ml == Mark Little marklit...@koallo.com writes:

ml Just to clarify - do you mean TLER should be off or on?

It should be set to ``do not have asvc_t 11 seconds and 1 io/s''.

...which is not one of the settings of the TLER knob.

This isn't a problem with the TLER *setting*.  TLER does not even
apply unless the drive has a latent sector error.

TLER does not even apply unless the drive has a latent sector error.

TLER does not even apply unless the drive has a latent sector error.

GOT IT?  so if the drive is not defective, but is erratically having
huge latency when not busy, this isn't a TLER problem.  It's a
drive-is-unpredictable-piece-of-junk problem.  Will the problem go
away if you change the TLER setting to the opposite of whatever it is?
Who knows?!  It shouldn't based on the claimed purpose of TLER, but in
reality, maybe, maybe not, because the drive shouldn't (``shouldn't'',
haha) act like that to begin with.  It will be more likely to go away
if you replace the drive with a different model, though.

ml Storage forum on hardforum.com, the experts there seem to
ml recommend NOT having TLER enabled when using ZFS as ZFS can be
ml configured for its timeouts, etc, 

I don't believe there are any configurable timeouts in ZFS.  The ZFS
developers take the position that timeouts are not our problem and
push all that work down the stack to the controller driver and the
disk driver, which cooperate (this is two drivers, now.  plus a third
``SCSI mid-layer'' perhaps, for some controllers but not others.) to
implement a variety of inconsistent, silly, undocumented cargo-cult
flailing timeout regimes that we all have to put up with.  However
they are always quite long.  The ATA max timeout is 30sec, and AIUI
they are all much longer than that.

My new favorite thing, though, is the reference counting.  OS: ``This
disk/iSCSIdisk is `busy' so you can't detach it''.  me: ``bullshit.
YOINK, detached, now deal with it.''  IMO this area is in need of some
serious bar-raising.

ml and the main reason to use TLER is when using those drives
ml with hardware RAID cards which will kick a drive out of the
ml array if it takes longer than 10 seconds.

yup.

which is something the drive will not do unless it encounters an
ERROR.  that is the E in TLER.  In other words, the feature as
described prevents you from noticing and invoking warranty replacement
on your about-to-fail drive.  For this you pay double.  Have I got
that right?

In any case the obvious proper place to fix this is in the
RAID-on-a-card firmware, not the disk firmware, if it does even need
fixing which is unclear to me.  unless the disk manufacturers are
going to offer a feature ``do not spend more than 1 second out of
every 2 seconds `trying harder' to read marginal data, just return
errors'' which woudl actually have real value, the only reason TLER is
proper is that it can convince all you gamers to pay twice as much for
a drive because they've flipped a single bit in the firmware and then
shovelled a big pile of bullshit into your heads.

ml Can anyone else here comment if they have had experience with
ml the WD drives and ZFS and if they have TLER enabled or
ml disabled?

I do not have any problems with drives dropping out of ZFS using the
normal TLER setting.

I do have problems with slowly-failing drives fucking up the whole
system.  ZFS doesn't deal with them gracefully, and I have to find the
bad drive and remove it by hand.  All this stuff about cold spares
automatically replacing and USARS never notice, is largely a fantasy.

Neither observation leads me to want TLER.

however observations like this ``why did my disks suddenly slow
down?'' lead me to avoid WD drives period, for ZFS or not ZFS or
anything at all.  Whipping up all this marketing sillyness around TLER
also leads me to avoid them because I know they will shovel bullshit
and FUD to justify jacked prices.


pgpMng48rq0w8.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done

2010-09-09 Thread Miles Nordin
 dm == David Magda dma...@ee.ryerson.ca writes:

dm http://www.theregister.co.uk/2010/09/09/oracle_netapp_zfs_dismiss/

http://www.groklaw.net/articlebasic.php?story=20050121014650517

says when the MPL was modified to become the CDDL, clauses were
removed which would have required Oracle to disclose any patent
licenses it might have negotiated with NetApp covering CDDL code.  The
disclosure would have to be added to hg, freeze or no: ``If
Contributor obtains such knowledge after the Modification is made
available as described in Section 3.2, Contributor shall promptly
modify the LEGAL file in all copies Contributor makes available
thereafter and shall take other steps (such as notifying appropriate
mailing lists or newsgroups) reasonably calculated to inform those who
received the Covered Code that new knowledge has been obtained.''
This is in MPL but removed from CDDL.

The groklaw poster's concern is that this is a mechanism through which
Oracle could manoever to make the CDDL worthless as a guarantee of zfs
users' software freedom.  CDDL does implicitly grant rights to
Oracle's patents, but not to negotiations for shield from NetApp's.

AIUI GPLv3 is different and does not have this problem, though I don't
understand it well so I could be wrong.  With MPL at least we would
know about the negotiations: the settlement was ``secret'' which is
exactly the disaster scenario the groklaw poster warned of.

I'm sorry you cannot be uninterested in licenses and ``just want to
get work done.''

To me it looks like the patent situation is mostly an obstacle to
getting ZFS development funded.  If you used ZFS secretly in some kind
of cloud service, and never told anyone about it, you could be pretty
certain of getting away with it without any patent claims throughout
the entire decade or so that ZFS remains relevant, but if you want to
participate in a horizontally-divided market like Coraid, or otherwise
share source changes, you might get sued.  This regime has to be a
huge drag on the industry, and it makes things really unpredictable
which has to discourage investment, and it strongly favours large
companies.


pgpLRI59okaob.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done

2010-09-09 Thread Erik Trimble

 On 9/9/2010 11:11 AM, Garrett D'Amore wrote:

On Thu, 2010-09-09 at 12:58 -0500, Bob Friesenhahn wrote:

On Thu, 9 Sep 2010, Erik Trimble wrote:

Yes, it's welcome to get it over with.

I do get to bitch about one aspect here of the US civil legal system, though.
If you've gone so far as to burn our (the public's) time and money to file a
lawsuit, you shouldn't be able to seal up the court transcript, or have a
non-public settlement.  Call it the price you pay for wasting our time (i.e.
the court system's time).

Unfortunately, this may just be a case of Oracle's patents vs NetApp's
patents.  Oracle obviously holds a lot of patents and could
counter-sue using one of its own patents.  Oracle's handshake
agreement with NetApp does not in any way shield other zfs commercial
users from a patent lawsuit from NetApp.

True.  But, I wonder if the settlement sets a precedent?

Certainly the lack of a successful lawsuit has *failed* to set any
precedent conclusively indicating that NetApp has enforceable patents
where ZFS is concerned.

IANAL, but it seems like if Oracle and NetApp were to reach some kind of
licensing arrangement, then it might be construed to be anticompetitive
if NetApp were to fail to offer similar licensing arrangements to
downstream consumers?  Does anyone know if there is any basis for such a
theory, or are these just my idle imaginings?

As far as I know, Nexenta has not been approached by NetApp.  I'd like
to see what happens with Coraid ... but ultimately those conversations
are between Coraid and NetApp.

- Garrett


This is *exactly* the reason I advocate forced public settlement 
agreements.   If you've availed yourself of the court system, you should 
be obligated to put into the public record any agreement reached, just 
as if you had gotten a verdict.  It would help prevent a lot of the 
cross-licensing discrimination due to secrecy.


Oh well.

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done

2010-09-09 Thread Bob Friesenhahn

On Thu, 9 Sep 2010, Garrett D'Amore wrote:


True.  But, I wonder if the settlement sets a precedent?


No precedent has been set.


Certainly the lack of a successful lawsuit has *failed* to set any
precedent conclusively indicating that NetApp has enforceable patents
where ZFS is concerned.


Right.


IANAL, but it seems like if Oracle and NetApp were to reach some kind of
licensing arrangement, then it might be construed to be anticompetitive
if NetApp were to fail to offer similar licensing arrangements to
downstream consumers?  Does anyone know if there is any basis for such a
theory, or are these just my idle imaginings?


Idle imaginings.  A patent holder is not compelled to license use of 
the patent to anyone else, and can be selective regarding who gets a 
license.



As far as I know, Nexenta has not been approached by NetApp.  I'd like
to see what happens with Coraid ... but ultimately those conversations
are between Coraid and NetApp.


There should be little doubt that NetApp's goal was to make money by 
suing Sun.  Nexenta does not have enough income/assets to make a risky 
lawsuit worthwhile.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] resilver = defrag?

2010-09-09 Thread Orvar Korvar
A) Resilver = Defrag. True/false?

B) If I buy larger drives and resilver, does defrag happen?

C) Does zfs send zfs receive mean it will defrag?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] How to migrate to 4KB sector drives?

2010-09-09 Thread Orvar Korvar
ZFS does not handle 4K sector drives well, you need to create a new zpool with 
4K property (ashift) set. 
http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html

Are there plans to allow resilver to handle 4K sector drives?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-09 Thread Freddie Cash
On Thu, Sep 9, 2010 at 1:04 PM, Orvar Korvar
knatte_fnatte_tja...@yahoo.com wrote:
 A) Resilver = Defrag. True/false?

False.  Resilver just rebuilds a drive in a vdev based on the
redundant data stored on the other drives in the vdev.  Similar to how
replacing a dead drive works in a hardware RAID array.

 B) If I buy larger drives and resilver, does defrag happen?

No.

 C) Does zfs send zfs receive mean it will defrag?

No.

ZFS doesn't currently have a defragmenter.  That will come when the
legendary block pointer rewrite feature is committed.


-- 
Freddie Cash
fjwc...@gmail.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done

2010-09-09 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Bob Friesenhahn
 
 There should be little doubt that NetApp's goal was to make money by
 suing Sun.  Nexenta does not have enough income/assets to make a risky
 lawsuit worthwhile.

But in all likelihood, Apple still won't touch ZFS.  Apple would be worth
suing.  A big fat juicy...

On interesting take-away point, however:  Oracle is now in a solid position
to negotiate with Apple.  If Apple wants to pay for ZFS and indemnification
against netapp lawsuit, Oracle can grant it.  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-09 Thread Marty Scholes
I am speaking from my own observations and nothing scientific such as reading 
the code or designing the process.

 A) Resilver = Defrag. True/false?

False

 B) If I buy larger drives and resilver, does defrag
 happen?

No.  The first X sectors of the bigger drive are identical to the smaller 
drive, fragments and all.

 C) Does zfs send zfs receive mean it will defrag?

Yes.  The data is laid out on the receiving side in a sane manner, until it 
later becomes fragmented.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-09 Thread Freddie Cash
On Thu, Sep 9, 2010 at 1:26 PM, Freddie Cash fjwc...@gmail.com wrote:
 On Thu, Sep 9, 2010 at 1:04 PM, Orvar Korvar
 knatte_fnatte_tja...@yahoo.com wrote:
 A) Resilver = Defrag. True/false?

 False.  Resilver just rebuilds a drive in a vdev based on the
 redundant data stored on the other drives in the vdev.  Similar to how
 replacing a dead drive works in a hardware RAID array.

 B) If I buy larger drives and resilver, does defrag happen?

 No.

Actually, thinking about it ... since the resilver is writing new data
to an empty drive, in essence, the drive is defragmented.

 C) Does zfs send zfs receive mean it will defrag?

 No.

Same here, but only if the receiving pool has never had any snapshots
deleted or files deleted, so that there are no holes in the pool.
Then the newly written data will be contiguous (not fragmented).

 ZFS doesn't currently have a defragmenter.  That will come when the
 legendary block pointer rewrite feature is committed.


 --
 Freddie Cash
 fjwc...@gmail.com




-- 
Freddie Cash
fjwc...@gmail.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done

2010-09-09 Thread Tim Cook
On Thu, Sep 9, 2010 at 2:49 PM, Bob Friesenhahn 
bfrie...@simple.dallas.tx.us wrote:

 On Thu, 9 Sep 2010, Garrett D'Amore wrote:


 True.  But, I wonder if the settlement sets a precedent?


 No precedent has been set.


  Certainly the lack of a successful lawsuit has *failed* to set any
 precedent conclusively indicating that NetApp has enforceable patents
 where ZFS is concerned.


 Right.


  IANAL, but it seems like if Oracle and NetApp were to reach some kind of
 licensing arrangement, then it might be construed to be anticompetitive
 if NetApp were to fail to offer similar licensing arrangements to
 downstream consumers?  Does anyone know if there is any basis for such a
 theory, or are these just my idle imaginings?


 Idle imaginings.  A patent holder is not compelled to license use of the
 patent to anyone else, and can be selective regarding who gets a license.


  As far as I know, Nexenta has not been approached by NetApp.  I'd like
 to see what happens with Coraid ... but ultimately those conversations
 are between Coraid and NetApp.


 There should be little doubt that NetApp's goal was to make money by suing
 Sun.  Nexenta does not have enough income/assets to make a risky lawsuit
 worthwhile.



There should be little doubt it's a complete waste of money for NetApp go to
court with a second party when the outcome of their primary lawsuit will
determine the outcome of the second.  They had absolutely nothing to gain by
suing Nexenta if they still had a pending lawsuit with Sun.  Furthermore,
unless you work as legal counsel for Nexenta, I'd say you have absolutely no
clue whether or not they received a cease and desist from NetApp.

I *STRONGLY* doubt the goal was money for NetApp.  They've got that coming
out of their ears.  It was either cross licensing issues (almost assuredly
this), or a hope to stop/slow down ZFS.

If they had, I strongly doubt it's something they would want to publicize.
 It wouldn't exactly give potential customers the warm and fuzzies.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Haudy Kazemi

Comment at end...

Mattias Pantzare wrote:

On Wed, Sep 8, 2010 at 15:27, Edward Ned Harvey sh...@nedharvey.com wrote:
  

From: pantz...@gmail.com [mailto:pantz...@gmail.com] On Behalf Of
Mattias Pantzare

It
is about 1 vdev with 12 disk or  2 vdev with 6 disks. If you have 2
vdev you have to read half the data compared to 1 vdev to resilver a
disk.
  

Let's suppose you have 1T of data.  You have 12-disk raidz2.  So you have
approx 100G on each disk, and you replace one disk.  Then 11 disks will each
read 100G, and the new disk will write 100G.

Let's suppose you have 1T of data.  You have 2 vdev's that are each 6-disk
raidz1.  Then we'll estimate 500G is on each vdev, so each disk has approx
100G.  You replace a disk.  Then 5 disks will each read 100G, and 1 disk
will write 100G.

Both of the above situations resilver in equal time, unless there is a bus
bottleneck.  21 disks in a single raidz3 will resilver just as fast as 7
disks in a raidz1, as long as you are avoiding the bus bottleneck.  But 21
disks in a single raidz3 provides better redundancy than 3 vdev's each
containing a 7 disk raidz1.

In my personal experience, approx 5 disks can max out approx 1 bus.  (It
actually ranges from 2 to 7 disks, if you have an imbalance of cheap disks
on a good bus, or good disks on a crap bus, but generally speaking people
don't do that.  Generally people get a good bus for good disks, and cheap
disks for crap bus, so approx 5 disks max out approx 1 bus.)

In my personal experience, servers are generally built with a separate bus
for approx every 5-7 disk slots.  So what it really comes down to is ...

Instead of the Best Practices Guide saying Don't put more than ___ disks
into a single vdev, the BPG should say Avoid the bus bandwidth bottleneck
by constructing your vdev's using physical disks which are distributed
across multiple buses, as necessary per the speed of your disks and buses.



This is assuming that you have no other IO besides the scrub.

You should of course keep the number of disks in a vdev low for
general performance reasons unless you only have linear reads (as your
IOPS will be close to what only one disk can give for the whole vdev).
There is another optimization in the Best Practices Guide that says the 
number of devices in a vdev should be (N+P) with P = 1 (raidz), 2 
(raidz2), or 3 (raidz3) and N equals 2, 4, or 8.

I.e. 2^N + P where N is 1, 2, or 3 and P is the RAIDZ level.

I.e. Optimal sizes
RAIDZ1 vdevs should have 3, 5, or 9 devices in each vdev
RAIDZ2 vdevs should have 4, 6, or 10 devices in each vdev
RAIDZ3 vdevs should have 5, 7, or 11 devices in each vdev


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Haudy Kazemi

Erik Trimble wrote:

 On 9/9/2010 2:15 AM, taemun wrote:
Erik: does that mean that keeping the number of data drives in a 
raidz(n) to a power of two is better? In the example you gave, you 
mentioned 14kb being written to each drive. That doesn't sound very 
efficient to me.


(when I say the above, I mean a five disk raidz or a ten disk raidz2, 
etc)


Cheers,



Well, since the size of a slab can vary (from 512 bytes to 128k), it's 
hard to say. Length (size) of the slab is likely the better 
determination. Remember each block on a hard drive is 512 bytes (for 
now).  So, it's really not any more efficient to write 16k than 14k 
(or vice versa). Both are integer multiples of 512 bytes.


IIRC, there was something about using a power-of-two number of data 
drives in a RAIDZ, but I can't remember what that was. It may just be 
a phantom memory.


Not a phantom memory...

From Matt Ahrens in a thread titled 'Metaslab alignment on RAID-Z':
http://www.opensolaris.org/jive/thread.jspa?messageID=60241
'To eliminate the blank round up sectors for power-of-two blocksizes 
of 8k or larger, you should use a power-of-two plus 1 number of disks in 
your raid-z group -- that is, 3, 5, or 9 disks (for double-parity, use a 
power-of-two plus 2 -- that is, 4, 6, or 10). Smaller blocksizes are 
more constrained; for 4k, use 3 or 5 disks (for double parity, use 4 or 
6) and for 2k, use 3 disks (for double parity, use 4).'



These round up sectors are skipped and used as padding to simplify space 
accounting and improve performance.  I may have referred to them as zero 
padding sectors in other posts, however they're not necessarily zeroed.


See the thread titled 'raidz stripe size (not stripe width)' 
http://opensolaris.org/jive/thread.jspa?messageID=495351



This looks to be the reasoning behind the optimization in the ZFS Best 
Practices Guide that says the number of devices in a vdev should be 
(N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8.

I.e. 2^N + P where N is 1, 2, or 3 and P is the RAIDZ level.

I.e. Optimal sizes
RAIDZ1 vdevs should have 3, 5, or 9 devices in each vdev
RAIDZ2 vdevs should have 4, 6, or 10 devices in each vdev
RAIDZ3 vdevs should have 5, 7, or 11 devices in each vdev

The best practices guide recommendation of 3-9 devices per vdev appears 
based on RAIDZ1's optimal size with 3-9 devices when N=1 to 3 in 2^N + P.


Victor Latushkin in a thread titled 'odd versus even' said the same 
thing.  Adam Leventhal said this had a 'very slight space-efficiency 
benefit' in the same thread.

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg05460.html

---
That said, the recommendations in the Best Practices Guide for RAIDZ2 to 
start with 5 disks and RAIDZ3 to start with 8 disks, do not match with 
the last recommendation.  What is the reasoning behind 5 and 8?  
Reliability vs space?

Start a single-parity RAIDZ (raidz) configuration at 3 disks (2+1)
Start a double-parity RAIDZ (raidz2) configuration at 5 disks (3+2)
Start a triple-parity RAIDZ (raidz3) configuration at 8 disks (5+3)
(N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8


Perhaps the Best Practices Guide should also recommend:
-the use of striped vdevs in order to bring up the IOPS number, 
particularly when using enough hard drives to meet the capacity and 
reliability requirements.

-avoiding slow consumer class drives (fast ones may be okay for some users)
-more sample array configurations for common drive chassis capacities
-consider using a RAIDZ1 main pool with RAIDZ1 backup pool rather than 
higher level RAIDZ or mirroring (touch on the value of backup vs. 
stronger RAIDZ)
-watch out for BIOS or firmware upgrades that change host protected area 
(HPA) settings on drives making them appear smaller than before


The BPG should also resolve this discrepancy:
Storage Pools section: For production systems, use whole disks rather 
than slices for storage pools for the following reasons
Additional Cautions for Storage Pools: Consider planning ahead and 
reserving some space by creating a slice which is smaller than the whole 
disk instead of the whole disk.


---


Other (somewhat) related threads:


From Darren Dunham in a thread titled 'ZFS raidz2 number of disks':
http://groups.google.com/group/comp.unix.solaris/browse_thread/thread/dd1b5997bede5265
' 1 Why is the recommendation for a raidz2 3-9 disk, what are the cons 
for having 16 in a pool compared to 8?
Reads potentially have to pull data from all data columns to reconstruct 
a filesystem block for verification.  For random read workloads, 
increasing the number of columns in the raidz does not increase the read 
iops.  So limiting the column count usually makes sense (with a cost 
tradeoff).  16 is valid, but not recommended.'




From Richard Relling in a thread titled 'rethinking RaidZ and Record size':
http://opensolaris.org/jive/thread.jspa?threadID=121016
'The raidz pathological worst case is a random read 

Re: [zfs-discuss] performance leakage when copy huge data

2010-09-09 Thread Fei Xu
Just to update the status and findings.

I've checked TLER settings and they are off by default.

I moved the source pool to another chassis and do the 3.8TB send again.  this 
time, not any problems!  the difference is 
1. New chassis
2. BIGGER memory.  32GB v.s 12GB
3. although wdidle time is disabled by default, I've change the HD mode from 
silent to performance in HDtune.  this is what I once heard from some website 
that might also fix the disk head park/unpark issue (aka, C1).

seems TLER is not the root cause or at least, set to off is ok.

my next step will be 
1. move back HD to see if it's the performance mode fix the issue
2. if not, add more memory and try again.

by the way, in HDtune, I saw C7: Ultra DMA CRC error count is a little high 
which indicates a potential connection issue.  Maybe all are caused by the 
enclosure?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-09 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Orvar Korvar
 
 A) Resilver = Defrag. True/false?

I think everyone will agree false on this question.  However, more detail
may be appropriate.  See below.


 B) If I buy larger drives and resilver, does defrag happen?

Scores so far:
2 No
1 Yes


 C) Does zfs send zfs receive mean it will defrag?

Scores so far:
1 No
2 Yes

 ...

Does anybody here know what they're talking about?  I'd feel good if perhaps
Erik ... or Neil ... perhaps ... answered the question with actual
knowledge.

Thanks...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Edward Ned Harvey
 From: Haudy Kazemi [mailto:kaze0...@umn.edu]

 There is another optimization in the Best Practices Guide that says the
 number of devices in a vdev should be (N+P) with P = 1 (raidz), 2
 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8.
 I.e. 2^N + P where N is 1, 2, or 3 and P is the RAIDZ level.
 
 I.e. Optimal sizes
 RAIDZ1 vdevs should have 3, 5, or 9 devices in each vdev
 RAIDZ2 vdevs should have 4, 6, or 10 devices in each vdev
 RAIDZ3 vdevs should have 5, 7, or 11 devices in each vdev

This sounds logical, although I don't know how real it is.  The logic seems
to be ... Assuming slab sizes of 128K, the amount of data written to each
disk within the vdev gets divided into something which is a multiple of 512b
or 4K (newer drives supposedly starting to use 4K block sizes instead of
512b).  

But I have doubts about the real-ness here, because ... An awful lot of
times, your actual slabs are smaller than 128K just because you're not
performing sustained sequential writes very often.

But it seems to make sense, whenever you *do* have some sequential writes,
you would want the data written to each disk to be a multiple of 512b or 4K.
If you had a 128K slab, divided into 5, then each disk would write 25.6K and
even for sustained sequential writes, some degree of fragmentation would be
impossible to avoid.  Actually, I don't think fragmentation is techinically
the correct term for that behavior.  It might be more appropriate to simply
say it forces a less-than-100% duty cycle.

And another thing ... Doesn't the checksum take up some space anyway?  Even
if you obeyed the BPG and used ... let's say ... 4 disks for N ... then each
disk has 32K of data to write, which is a multiple of 4K and 512b ... but
each disk also needs to write the checksum.  So each disk writes 32K + a few
bytes.  Which defeats the whole purpose anyway, doesn't it?

The effect, if real at all, might be negligible.  I don't know how small it
is, but I'm quite certain it's not huge.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-09 Thread Bill Sommerfeld

On 09/09/10 20:08, Edward Ned Harvey wrote:

Scores so far:
2 No
1 Yes


No.  resilver does not re-layout your data or change whats in the block 
pointers on disk.  if it was fragmented before, it will be fragmented after.



C) Does zfs send zfs receive mean it will defrag?


Scores so far:
1 No
2 Yes


maybe.  If there is sufficient contiguous freespace in the destination 
pool, files may be less fragmented.


But if you do incremental sends of multiple snapshots, you may well 
replicate some or all the fragmentation on the origin (because snapshots 
only copy the blocks that change, and receiving an incremental send does 
the same).


And if the destination pool is short on space you may end up more 
fragmented than the source.


- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss