Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-13 Thread Hatish Narotam
Hi,

*The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each
ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller.*

I was simply listing the bandwidth available at the different stages of the
data cycle. The PCIE port gives me 32Gbps. The Sata card gives me a possible
12Gbps. I'd rather be cautious and asuume I'll get more like 6Gbps, it is a
cheap card after all.

*I guarantee you this is not a sustainable speed for 7.2krpm sata disks.* (I
am well aware :) )

* Which is 333% of the PM's capability. *

Assuming that it is, 5 drives at that speed will max out my PM 3 times over.
So my PM will automatically throttle the drives speed to a third of that on
the account that the PM will be maxed out.

Thanks for the rough IO speed check :)


On Thu, Sep 9, 2010 at 3:20 PM, Edward Ned Harvey wrote:

> > From: Hatish Narotam [mailto:hat...@gmail.com]
> >
> > PCI-E 8X 4-port ESata Raid Controller.
> > 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on
> > the controller).
> > 20 x Samsung 1TB HDD's. (each connected to a Port Multiplier).
>
> Assuming your disks can all sustain 500Mbit/sec, which I find to be typical
> for 7200rpm sata disks, and you have groups of 5 that all have a 3Gbit
> upstream bottleneck, it means each of your groups of 5 should be fine in a
> raidz1 configuration.
>
> You think that your sata card can do 32Gbit because it's on a PCIe x8 bus.
> I highly doubt it unless you paid a grand or two for your sata controller,
> but please prove me wrong.  ;-)  I think the backplane of the sata
> controller is more likely either 3G or 6G.
>
> If it's 3G, then you should use 4 groups of raidz1.
> If it's 6G, then you can use 2 groups of raidz2 (because 10 drives of
> 500Mbit can only sustain 5Gbit)
> If it's 12G or higher, then you can make all of your drives one big vdev of
> raidz3.
>
>
> > According to Samsungs site, max read speed is 250MBps, which
> > translates to 2Gbps. Multiply by 5 drives gives you 10Gbps.
>
> I guarantee you this is not a sustainable speed for 7.2krpm sata disks.
>  You
> can get a decent measure of sustainable speed by doing something like:
>(write 1G byte)
>time dd if=/dev/zero of=/some/file bs=1024k count=1024
>(beware: you might get an inaccurate speed measurement here
>due to ram buffering.  See below.)
>
>(reboot to ensure nothing is in cache)
>(read 1G byte)
>time dd if=/some/file of=/dev/null bs=1024k
>(Now you're certain you have a good measurement.
>If it matches the measurement you had before,
>that means your original measurement was also
>accurate.  ;-) )
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-13 Thread Hatish Narotam
A, I see. But I think your math is a bit out:

62.5e6 iop @ 100iops
= 625000 seconds
= 10416m
= 173h
= 7D6h.

So 7 days & 6 hours. Thats long, but I can live with it. This isnt for an
enterprise environment. While the length of time is of worry in terms of
increasing the chance another drive will fail, in my mind that is mitigated
by the fact that the drives wont be under major stress during that time. Its
a workable solution.

On Thu, Sep 9, 2010 at 3:03 PM, Erik Trimble wrote:

>  On 9/9/2010 5:49 AM, hatish wrote:
>
>> Very interesting...
>>
>> Well, lets see if we can do the numbers for my setup.
>>
>>  From a previous post of mine:
>>
>> [i]This is my exact breakdown (cheap disks on cheap bus :P) :
>>
>>
>> PCI-E 8X 4-port ESata Raid Controller.
>> 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the
>> controller).
>> 20 x Samsung 1TB HDD's. (each connected to a Port Multiplier).
>>
>> The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each
>> ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each
>> PM can give up to 3Gbps, which is shared amongst 5 drives. According to
>> Samsungs site, max read speed is 250MBps, which translates to 2Gbps.
>> Multiply by 5 drives gives you 10Gbps. Which is 333% of the PM's capability.
>> So the drives arent likely to hit max read speed for long lengths of time,
>> especially during rebuild time.
>>
>> So the bus is going to be quite a bottleneck. Lets assume that the drives
>> are 80% full. Thats 800GB that needs to be read on each drive, which is
>> (800x9) 7.2TB.
>> Best case scenario, we can read 7.2TB at 3Gbps
>> = 57.6 Tb at 3Gbps
>> = 57600 Gb at 3Gbps
>> = 19200 seconds
>> = 320 minutes
>> = 5 Hours 20 minutes.
>>
>> Even if it takes twice that amount of time, Im happy.
>>
>> Initially I had been thinking 2 PM's for each vdev. But now Im thinking
>> maybe split it wide as best I can ([2Ddisks per PM] x 2, [2Ddisks&1Pdisk per
>> PM] x 2) for each vdev. It'll give the best possible speed, but still wont
>> max out the HDD's.
>>
>> I've never actually sat and done the math before. Hope its decently
>> accurate :)[/i]
>>
>> My scenario, as from Erik's post:
>> Scenario: I have 10 1TB disks in a raidz2, and I have 128k
>> slab sizes. Thus, I have 16k of data for each slab written to each
>> disk. (8x16k data + 32k parity for a 128k slab size). So, each IOPS
>> gets to reconstruct 16k of data on the failed drive. It thus takes
>> about 1TB/16k = 62.5e6 IOPS to reconstruct the full 1TB drive.
>>
>> Lets assume the drives are at 95% capacity, which is a pretty bad
>> scenario. So thats 7600GB, which is 60800Gb. There will be no other IO while
>> a rebuild is going.
>> Best Case: I'll read at 12Gbps,&  write at 3Gbps (4:1). I read 128K for
>> every 16K I write (8:1). Hence the read bandwidth will be the bottleneck. So
>> 60800Gb @ 12Gbps is 5066s which is 84m27s (Never gonna happen). A more
>> realistic read of 1.5Gbps gives me 40533s, which is 675m33s, which is
>> 11h15m33s. Which is a more realistic time to read 7.6TB.
>>
>
>
> Actually, your biggest bottleneck will be the IOPS limits of the drives.  A
> 7200RPM SATA drive tops out at 100 IOPS.  Yup. That's it.
>
> So, if you need to do 62.5e6 IOPS, and the rebuild drive can do just 100
> IOPS, that means you will finish (best case) in 62.5e4 seconds.  Which is
> over 173 hours. Or, about 7.25 WEEKS.
>
>
> --
> Erik Trimble
> Java System Support
> Mailstop:  usca22-123
> Phone:  x17195
> Santa Clara, CA
> Timezone: US/Pacific (GMT-0800)
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-13 Thread Hatish Narotam
Mattias, what you say makes a lot of sense. When I saw *Both of the above
situations resilver in equal time*, I was like "no way!" But like you said,
assuming no bus bottlenecks.

This is my exact breakdown (cheap disks on cheap bus :P) :

PCI-E 8X 4-port ESata Raid Controller.
4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the
controller).
20 x Samsung 1TB HDD's. (each connected to a Port Multiplier).

The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each
ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each
PM can give up to 3Gbps, which is shared amongst 5 drives. According to
Samsungs site, max read speed is 250MBps, which translates to 2Gbps.
Multiply by 5 drives gives you 10Gbps. Which is 333% of the PM's capability.
So the drives arent likely to hit max read speed for long lengths of time,
especially during rebuild time.

So the bus is going to be quite a bottleneck. Lets assume that the drives
are 80% full. Thats 800GB that needs to be read on each drive, which is
(800x9) 7.2TB.
Best case scenario, we can read 7.2TB at 3Gbps
= 57.6 Tb at 3Gbps
= 57600 Gb at 3Gbps
= 19200 seconds
= 320 minutes
= 5 Hours 20 minutes.

Even if it takes twice that amount of time, Im happy.

Initially I had been thinking 2 PM's for each vdev. But now Im thinking
maybe split it wide as best I can ([2disks per PM] x 2, [3disks per PM] x 2)
for each vdev. It'll give the best possible speed, but still wont max out
the HDD's.

I've never actually sat and done the math before. Hope its decently accurate
:)

On Wed, Sep 8, 2010 at 3:27 PM, Edward Ned Harvey wrote:

> > From: pantz...@gmail.com [mailto:pantz...@gmail.com] On Behalf Of
> > Mattias Pantzare
> >
> > It
> > is about 1 vdev with 12 disk or  2 vdev with 6 disks. If you have 2
> > vdev you have to read half the data compared to 1 vdev to resilver a
> > disk.
>
> Let's suppose you have 1T of data.  You have 12-disk raidz2.  So you have
> approx 100G on each disk, and you replace one disk.  Then 11 disks will
> each
> read 100G, and the new disk will write 100G.
>
> Let's suppose you have 1T of data.  You have 2 vdev's that are each 6-disk
> raidz1.  Then we'll estimate 500G is on each vdev, so each disk has approx
> 100G.  You replace a disk.  Then 5 disks will each read 100G, and 1 disk
> will write 100G.
>
> Both of the above situations resilver in equal time, unless there is a bus
> bottleneck.  21 disks in a single raidz3 will resilver just as fast as 7
> disks in a raidz1, as long as you are avoiding the bus bottleneck.  But 21
> disks in a single raidz3 provides better redundancy than 3 vdev's each
> containing a 7 disk raidz1.
>
> In my personal experience, approx 5 disks can max out approx 1 bus.  (It
> actually ranges from 2 to 7 disks, if you have an imbalance of cheap disks
> on a good bus, or good disks on a crap bus, but generally speaking people
> don't do that.  Generally people get a good bus for good disks, and cheap
> disks for crap bus, so approx 5 disks max out approx 1 bus.)
>
> In my personal experience, servers are generally built with a separate bus
> for approx every 5-7 disk slots.  So what it really comes down to is ...
>
> Instead of the Best Practices Guide saying "Don't put more than ___ disks
> into a single vdev," the BPG should say "Avoid the bus bandwidth bottleneck
> by constructing your vdev's using physical disks which are distributed
> across multiple buses, as necessary per the speed of your disks and buses."
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-13 Thread Hatish Narotam
Makes sense. My understanding is not good enough to confidently make my own
decisions, and I'm learning as Im going. The BPG says:

   - The recommended number of disks per group is between 3 and 9. If you
   have more disks, use multiple groups

If there was a reason leading up to this statement, I didnt follow it.

However, a few paragraphs later, their RaidZ2 example says [4x(9+2), 2 hot
spares, 18.0 TB]. So I guess 8+2 should be quite acceptable, especially
since performance is the lowest priority.



On Tue, Sep 7, 2010 at 4:59 PM, Edward Ned Harvey wrote:

> > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> > boun...@opensolaris.org] On Behalf Of hatish
> >
> > I have just
> > read the Best Practices guide, and it says your group shouldnt have > 9
> > disks.
>
> I think the value you can take from this is:
> Why does the BPG say that?  What is the reasoning behind it?
>
> Anything that is a "rule of thumb" either has reasoning behind it (you
> should know the reasoning) or it doesn't (you should ignore the rule of
> thumb, dismiss it as myth.)
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread hatish
Ahhh! So thats how the formula works. That makes perfect sense.

Lets take my case as a scenario:

Each of my vdevs is 10 disk RaidZ2 (8 data + 2 Parity). Using 128K stripe, I'll 
have 128K/8 = 16K blocks per data drive & 16K blocks per parity drive. That 
fits both 512B & 4KB.

It works in my favour that I'll have high average file sizes (>250MB). So I'll 
see minimal effect of the "fragmentation" mentioned.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread hatish
Very interesting...

Well, lets see if we can do the numbers for my setup.

>From a previous post of mine:

[i]This is my exact breakdown (cheap disks on cheap bus :P) :

PCI-E 8X 4-port ESata Raid Controller.
4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the 
controller).
20 x Samsung 1TB HDD's. (each connected to a Port Multiplier).

The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each ESata 
port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each PM can 
give up to 3Gbps, which is shared amongst 5 drives. According to Samsungs site, 
max read speed is 250MBps, which translates to 2Gbps. Multiply by 5 drives 
gives you 10Gbps. Which is 333% of the PM's capability. So the drives arent 
likely to hit max read speed for long lengths of time, especially during 
rebuild time.

So the bus is going to be quite a bottleneck. Lets assume that the drives are 
80% full. Thats 800GB that needs to be read on each drive, which is (800x9) 
7.2TB.
Best case scenario, we can read 7.2TB at 3Gbps
= 57.6 Tb at 3Gbps
= 57600 Gb at 3Gbps
= 19200 seconds
= 320 minutes
= 5 Hours 20 minutes.

Even if it takes twice that amount of time, Im happy.

Initially I had been thinking 2 PM's for each vdev. But now Im thinking maybe 
split it wide as best I can ([2Ddisks per PM] x 2, [2Ddisks&1Pdisk per PM] x 2) 
for each vdev. It'll give the best possible speed, but still wont max out the 
HDD's.

I've never actually sat and done the math before. Hope its decently accurate 
:)[/i]

My scenario, as from Erik's post:
Scenario: I have 10 1TB disks in a raidz2, and I have 128k
slab sizes. Thus, I have 16k of data for each slab written to each
disk. (8x16k data + 32k parity for a 128k slab size). So, each IOPS
gets to reconstruct 16k of data on the failed drive. It thus takes
about 1TB/16k = 62.5e6 IOPS to reconstruct the full 1TB drive.

Lets assume the drives are at 95% capacity, which is a pretty bad scenario. So 
thats 7600GB, which is 60800Gb. There will be no other IO while a rebuild is 
going.
Best Case: I'll read at 12Gbps, & write at 3Gbps (4:1). I read 128K for every 
16K I write (8:1). Hence the read bandwidth will be the bottleneck. So 60800Gb 
@ 12Gbps is 5066s which is 84m27s (Never gonna happen). A more realistic read 
of 1.5Gbps gives me 40533s, which is 675m33s, which is 11h15m33s. Which is a 
more realistic time to read 7.6TB.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-08 Thread hatish
Rebuild time is not a concern for me. The concern with rebuilding was the 
stress it puts on the disks for an extended period of time (increasing the 
chances of another disk failure). The % of data used doesnt matter, as the 
system will try to get it done at max speed, thus creating the mentioned 
stress. But I suspect the Port multipliers will do a good job of throttling the 
IO such that the discs face minimal stress. Thus Im pretty sure I'll stick with 
2 x 10disk RaidZ2.

Thanks for all the input!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-07 Thread hatish
Thanks for all the replies :)

My mindset is split in two now...

Some detail - I'm using 4 1-to-5 Sata Port multipliers connected to a 4-port 
SATA raid card.

I only need reliability and size, as long as my performance is the equivalent 
of one drive, Im happy.

Im assuming all the data used in the group is read once when re-creating a lost 
drive. Also assuming space consumed is 50%.

So option 1 - Stay with the 2 x 10drive RaidZ2. My concern is the stress on the 
drives when one drive fails and the others go crazy (read-wise) to re-create 
the new drive. Is there no way to reduce this stress? Maybe limit the data 
rate, so its not quite so stressful, even though it will end up taking longer? 
(quite acceptable)
[Available Space: 16TB, Redundancy Space: 4TB, Repair data read: 4.5TB]

And option 2 - Add a 21st drive to one of the motherboard sata ports. And then 
go with 3 x 7drive RaidZ2. [Available Space: 15TB, Redundancy Space: 6TB, 
Repair data read: 3TB]

Sadly, SSD's wont go too well in a PM based setup like mine. I may add it 
directly onto the MB if I can afford it. But again, performance is not a 
prioity.

Any further thoughts and ideas are much appreciated.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Suggested RaidZ configuration...

2010-09-06 Thread hatish
Im setting up a server with 20x1TB disks. Initially I had thought to setup the 
disks using 2 RaidZ2 groups of 10 discs. However, I have just read the Best 
Practices guide, and it says your group shouldnt have > 9 disks. So Im thinking 
a better configuration would be 2 x 7disk RaidZ2 + 1 x 6disk RaidZ2. However 
its 14TB worth of data instead of 16TB. 

What are your suggestions and experiences?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss