Re: [PERFORM] RAID stripe size question

2006-08-04 Thread Mikael Carneholm
 WRT seek performance, we're doing 2500 seeks per second on the
Sun/Thumper on 36 disks.  

Luke, 

Have you had time to run benchmarksql against it yet? I'm just curious
about the IO seeks/s vs. transactions/minute correlation...

/Mikael






---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] RAID stripe size question

2006-08-03 Thread Merlin Moncure

On 8/3/06, Luke Lonergan [EMAIL PROTECTED] wrote:

Merlin,

 moving a gigabyte around/sec on the server, attached or no,
 is pretty heavy lifting on x86 hardware.



Maybe so, but we're doing 2GB/s plus on Sun/Thumper with software RAID
and 36 disks and 1GB/s on a HW RAID with 16 disks, all SATA.


that is pretty amazing, that works out to 55 mb/sec/drive, close to
theoretical maximums. are you using pci-e sata controller and raptors
im guessing?  this is doubly impressive if we are talking raid 5 here.
do you find that software raid is generally better than hardware at
the highend?  how much does this tax the cpu?


WRT seek performance, we're doing 2500 seeks per second on the
Sun/Thumper on 36 disks.  You might do better with 15K RPM disks and
great controllers, but I haven't seen it reported yet.


thats pretty amazing too.  only a highly optimized raid system can
pull this off.


BTW - I'm curious about the HP P600 SAS host based RAID controller - it
has very good specs, but is the Linux driver solid?


have no clue.  i sure hope i dont go through the same headaches as
with ibm scsi drivers (rebranded adaptec btw).  sas looks really
promising however.  the adaptec sas gear is so cheap it might be worth
it to just buy some and see what it can do.

merlin

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] RAID stripe size question

2006-08-02 Thread Merlin Moncure

On 7/18/06, Alex Turner [EMAIL PROTECTED] wrote:

Remember when it comes to OLTP, massive serial throughput is not gonna help
you, it's low seek times, which is why people still buy 15k RPM drives, and
why you don't necessarily need a honking SAS/SATA controller which can
harness the full 1066MB/sec of your PCI-X bus, or more for PCIe.  Of course,


hm. i'm starting to look seriously at SAS to take things to the next
level.  it's really not all that expensive, cheaper than scsi even,
and you can mix/match sata/sas drives in the better enclosures.  the
real wild card here is the raid controller.  i still think raptors are
the best bang for the buck and SAS gives me everything i like about
sata and scsi in one package.

moving a gigabyte around/sec on the server, attached or no, is pretty
heavy lifting on x86 hardware.

merlin

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] RAID stripe size question

2006-08-02 Thread Luke Lonergan
Merlin,

 moving a gigabyte around/sec on the server, attached or no, 
 is pretty heavy lifting on x86 hardware.

Maybe so, but we're doing 2GB/s plus on Sun/Thumper with software RAID
and 36 disks and 1GB/s on a HW RAID with 16 disks, all SATA.

WRT seek performance, we're doing 2500 seeks per second on the
Sun/Thumper on 36 disks.  You might do better with 15K RPM disks and
great controllers, but I haven't seen it reported yet.

BTW - I'm curious about the HP P600 SAS host based RAID controller - it
has very good specs, but is the Linux driver solid?

- Luke


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] RAID stripe size question

2006-07-18 Thread Ron Peacetree
From: Alex Turner [EMAIL PROTECTED]
Sent: Jul 18, 2006 12:21 AM
To: Ron Peacetree [EMAIL PROTECTED]
Cc: Mikael Carneholm [EMAIL PROTECTED], pgsql-performance@postgresql.org
Subject: Re: [PERFORM] RAID stripe size question

On 7/17/06, Ron Peacetree [EMAIL PROTECTED] wrote:

 -Original Message-
 From: Mikael Carneholm [EMAIL PROTECTED]
 Sent: Jul 17, 2006 5:16 PM
 To: Ron  Peacetree [EMAIL PROTECTED],
 pgsql-performance@postgresql.org
 Subject: RE: [PERFORM] RAID stripe size question
 
 I use 90% of the raid cache for writes, don't think I could go higher
 than that.
 Too bad the emulex only has 256Mb though :/
 
 If your RAID cache hit rates are in the 90+% range, you probably would
 find it profitable to make it greater.  I've definitely seen access patterns
 that benefitted from increased RAID cache for any size I could actually
 install.  For those access patterns, no amount of RAID cache commercially
 available was enough to find the flattening point of the cache percentage
 curve.  256MB of BB RAID cache per HBA is just not that much for many IO
 patterns.

90% as in 90% of the RAM, not 90% hit rate I'm imagining.

Either way, =particularly= for OLTP-like I/O patterns, the more RAID cache the 
better unless the IO pattern is completely random.  In which case the best you 
can do is cache the entire sector map of the RAID set and use as many spindles 
as possible for the tables involved.  I've seen high end set ups in Fortune 
2000 organizations that look like some of the things you read about on tpc.org: 
=hundreds= of HDs are used.

Clearly, completely random IO patterns are to be avoided whenever and however 
possible.

Thankfully, most things can be designed to not have completely random IO and 
stuff like WAL IO are definitely not random.

The important point here about cache size is that unless you make cache large 
enough that you see a flattening in the cache behavior, you probably can still 
use more cache.  Working sets are often very large for DB applications.

 
The controller is a FC2143 (
 http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=configProductLineId=450FamilyId=1449BaseId=17621oi=E9CEDBEID=19701SBLID=),
 which uses PCI-E. Don't know how it compares to other controllers, haven't
 had the time to search for / read any reviews yet.
 
 This is a relatively low end HBA with 1 4Gb FC on it.  Max sustained IO on
 it is going to be ~320MBps.  Or ~ enough for an 8 HD RAID 10 set made of
 75MBps ASTR HD's.

 28 such HDs are =definitely= IO choked on this HBA.

Not they aren't.  This is OLTP, not data warehousing.  I already posted math
for OLTP throughput, which is in the order of 8-80MB/second actual data
throughput based on maximum theoretical seeks/second.

WAL IO patterns are not OLTP-like.  Neither are most support or decision 
support IO patterns.  Even  in an OLTP system, there are usually only a few 
scenarios and tables where the IO pattern is pessimal.
Alex is quite correct that those few will be the bottleneck on overall system 
performance if the system's primary function is OLTP-like.

For those few, you dedicate as many spindles and RAID cache as you can afford 
and as show any performance benefit.  I've seen an entire HBA maxed out with 
cache and as many HDs as would saturate the attainable IO rate dedicated to =1= 
table (unfortunately SSD was not a viable option in this case).


The arithmetic suggests you need a better HBA or more HBAs or both.


 WAL's are basically appends that are written in bursts of your chosen
 log chunk size and that are almost never read afterwards.  Big DB pages and
 big RAID stripes makes sense for WALs.


unless of course you are running OLTP, in which case a big stripe isn't
necessary, spend the disks on your data parition, because your WAL activity
is going to be small compared with your random IO.

Or to put it another way, the scenarios and tables that have the most random 
looking IO patterns are going to be the performance bottleneck on the whole 
system.  In an OLTP-like system, WAL IO is unlikely to be your biggest 
performance issue.  As in any other performance tuning effort, you only gain by 
speeding up the current bottleneck.



 According to
 http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it
 seems to be the other way around? (As stripe size is decreased, files are
 broken into smaller and smaller pieces. This increases the number of drives
 that an average file will use to hold all the blocks containing the data of
 that file, theoretically increasing transfer performance, but decreasing
 positioning performance.)
 
 I guess I'll have to find out which theory that holds by good ol? trial
 and error... :)
 
 IME, stripe sizes of 64, 128, or 256 are the most common found to be
 optimal for most access patterns + SW + FS + OS + HW.


New records will be posted at the end of a file, and will only increase the
file by the number of blocks in the transactions posted at write time.
Updated records

Re: [PERFORM] RAID stripe size question

2006-07-18 Thread Mikael Carneholm
 This is a relatively low end HBA with 1 4Gb FC on it.  Max sustained
IO on it is going to be ~320MBps.  Or ~ enough for an 8 HD RAID 10 set
made of 75MBps ASTR HD's.

Looking at http://h30094.www3.hp.com/product.asp?sku=2260908extended=1,
I notice that the controller has a Ultra160 SCSI interface which implies
that the theoretical max throughput is 160Mb/s. Ouch.

However, what's more important is the seeks/s - ~530/s on a 28 disk
array is quite lousy compared to the 1400/s on a 12 x 15Kdisk array as
mentioned by Mark here:
http://archives.postgresql.org/pgsql-performance/2006-07/msg00170.php.
Could be the disk RPM (10K vs 15K) that makes the difference here...

I will test another stripe size (128K) for the DATA lun (28 disks) to
see what difference that makes, I think I read somewhere that linux
flushes blocks of 128K at a time, so it might be worth evaluating.

/Mikael



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] RAID stripe size question

2006-07-18 Thread Luke Lonergan
Title: Re: [PERFORM] RAID stripe size question



Mikael,

On 7/18/06 6:34 AM, Mikael Carneholm [EMAIL PROTECTED] wrote:

 However, what's more important is the seeks/s - ~530/s on a 28 disk
 array is quite lousy compared to the 1400/s on a 12 x 15Kdisk array

I'm getting 2500 seeks/second on a 36 disk SATA software RAID (ZFS, Solaris 10) on a Sun X4500:

=== Single Stream 

With a very recent update to the zfs module that improves I/O scheduling and prefetching, I get the following bonnie++ 1.03a results with a 36 drive RAID10, Solaris 10 U2 on an X4500 with 500GB Hitachi drives (zfs checksumming is off):

Version 1.03 --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
thumperdw-i-1 32G 120453 99 467814 98 290391 58 109371 99 993344 94 1801 4
--Sequential Create-- Random Create
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 + +++ + +++ + +++ 30850 99 + +++ + +++

=== Two Streams 

Bumping up the number of concurrent processes to 2, we get about 1.5x speed reads of RAID10 with a concurrent workload (you have to add the rates together): 

Version 1.03 --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
thumperdw-i-1 32G 111441 95 212536 54 171798 51 106184 98 719472 88 1233 2
--Sequential Create-- Random Create
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 26085 90 + +++ 5700 98 21448 97 + +++ 4381 97

Version 1.03 --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
thumperdw-i-1 32G 116355 99 212509 54 171647 50 106112 98 715030 87 1274 3
--Sequential Create-- Random Create
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 26082 99 + +++ 5588 98 21399 88 + +++ 4272 97

So thats 2500 seeks per second, 1440MB/s sequential block read, 212MB/s per character sequential read.
===

- Luke





Re: [PERFORM] RAID stripe size question

2006-07-18 Thread Alex Turner
This is a great testament to the fact that very often software RAID will seriously outperform hardware RAID because the OS guys who implemented it took the time to do it right, as compared with some controller manufacturers who seem to think it's okay to provided sub-standard performance.
Based on the bonnie++ numbers comming back from your array, I would also encourage you to evaluate software RAID, as you might see significantly better performance as a result. RAID 10 is also a good candidate as it's not so heavy on the cache and CPU as RAID 5.
Alex.On 7/18/06, Luke Lonergan [EMAIL PROTECTED] wrote:





Mikael,

On 7/18/06 6:34 AM, Mikael Carneholm 
[EMAIL PROTECTED] wrote:

 However, what's more important is the seeks/s - ~530/s on a 28 disk
 array is quite lousy compared to the 1400/s on a 12 x 15Kdisk array

I'm getting 2500 seeks/second on a 36 disk SATA software RAID (ZFS, Solaris 10) on a Sun X4500:

=== Single Stream 

With a very recent update to the zfs module that improves I/O scheduling and prefetching, I get the following bonnie++ 1.03a results with a 36 drive RAID10, Solaris 10 U2 on an X4500 with 500GB Hitachi drives (zfs checksumming is off):


Version 1.03 --Sequential Output-- --Sequential Input- --Random-

-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
thumperdw-i-1 32G 120453 99 467814 98 290391 58 109371 99 993344 94 1801 4
--Sequential Create-- Random Create
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 + +++ + +++ + +++ 30850 99 + +++ + +++

=== Two Streams 

Bumping up the number of concurrent processes to 2, we get about 1.5x speed reads of RAID10 with a concurrent workload (you have to add the rates together): 

Version 1.03 --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
thumperdw-i-1 32G 111441 95 212536 54 171798 51 106184 98 719472 88 1233 2
--Sequential Create-- Random Create
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 26085 90 + +++ 5700 98 21448 97 + +++ 4381 97

Version 1.03 --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
thumperdw-i-1 32G 116355 99 212509 54 171647 50 106112 98 715030 87 1274 3
--Sequential Create-- Random Create
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 26082 99 + +++ 5588 98 21399 88 + +++ 4272 97

So that's 2500 seeks per second, 1440MB/s sequential block read, 212MB/s per character sequential read.
===

- Luke







Re: [PERFORM] RAID stripe size question

2006-07-18 Thread Scott Marlowe
On Tue, 2006-07-18 at 14:27, Alex Turner wrote:
 This is a great testament to the fact that very often software RAID
 will seriously outperform hardware RAID because the OS guys who
 implemented it took the time to do it right, as compared with some
 controller manufacturers who seem to think it's okay to provided
 sub-standard performance. 
 
 Based on the bonnie++ numbers comming back from your array, I would
 also encourage you to evaluate software RAID, as you might see
 significantly better performance as a result.  RAID 10 is also a good
 candidate as it's not so heavy on the cache and CPU as RAID 5. 

Also, consider testing a mix, where your hardware RAID controller does
the mirroring and the OS stripes ((R)AID 0) over the top of it.  I've
gotten good performance from mediocre hardware cards doing this.  It has
the advantage of still being able to use the battery backed cache and
its instant fsync while not relying on some cards that have issues
layering RAID layers one atop the other.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] RAID stripe size question

2006-07-18 Thread Ron Peacetree
Have you done any experiments implementing RAID 50 this way (HBA does RAID 5, 
OS does RAID 0)?  If so, what were the results?

Ron

-Original Message-
From: Scott Marlowe [EMAIL PROTECTED]
Sent: Jul 18, 2006 3:37 PM
To: Alex Turner [EMAIL PROTECTED]
Cc: Luke Lonergan [EMAIL PROTECTED], Mikael Carneholm [EMAIL PROTECTED], 
Ron Peacetree [EMAIL PROTECTED], pgsql-performance@postgresql.org
Subject: Re: [PERFORM] RAID stripe size question

On Tue, 2006-07-18 at 14:27, Alex Turner wrote:
 This is a great testament to the fact that very often software RAID
 will seriously outperform hardware RAID because the OS guys who
 implemented it took the time to do it right, as compared with some
 controller manufacturers who seem to think it's okay to provided
 sub-standard performance. 
 
 Based on the bonnie++ numbers comming back from your array, I would
 also encourage you to evaluate software RAID, as you might see
 significantly better performance as a result.  RAID 10 is also a good
 candidate as it's not so heavy on the cache and CPU as RAID 5. 

Also, consider testing a mix, where your hardware RAID controller does
the mirroring and the OS stripes ((R)AID 0) over the top of it.  I've
gotten good performance from mediocre hardware cards doing this.  It has
the advantage of still being able to use the battery backed cache and
its instant fsync while not relying on some cards that have issues
layering RAID layers one atop the other.


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] RAID stripe size question

2006-07-18 Thread Scott Marlowe
Nope, haven't tried that.  At the time I was testing this I didn't even
think of trying it.  I'm not even sure I'd heard of RAID 50 at the
time... :)

I basically had an old MegaRAID 4xx series card in a dual PPro 200 and a
stack of 6 9 gig hard drives.  Spare parts.  And even though the RAID
1+0 was relatively much faster on this hardware, the Dual P IV 2800 with
a pair of 15k USCSI drives and a much later model MegaRAID at it for
lunch with a single mirror set, and was plenty fast for our use at the
time, so I never really had call to test it in production.

But it definitely made our test server, the aforementioned PPro200
machine, more livable.

On Tue, 2006-07-18 at 14:43, Ron Peacetree wrote:
 Have you done any experiments implementing RAID 50 this way (HBA does RAID 5, 
 OS does RAID 0)?  If so, what were the results?
 
 Ron
 
 -Original Message-
 From: Scott Marlowe [EMAIL PROTECTED]
 Sent: Jul 18, 2006 3:37 PM
 To: Alex Turner [EMAIL PROTECTED]
 Cc: Luke Lonergan [EMAIL PROTECTED], Mikael Carneholm [EMAIL PROTECTED], 
 Ron Peacetree [EMAIL PROTECTED], pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] RAID stripe size question
 
 On Tue, 2006-07-18 at 14:27, Alex Turner wrote:
  This is a great testament to the fact that very often software RAID
  will seriously outperform hardware RAID because the OS guys who
  implemented it took the time to do it right, as compared with some
  controller manufacturers who seem to think it's okay to provided
  sub-standard performance. 
  
  Based on the bonnie++ numbers comming back from your array, I would
  also encourage you to evaluate software RAID, as you might see
  significantly better performance as a result.  RAID 10 is also a good
  candidate as it's not so heavy on the cache and CPU as RAID 5. 
 
 Also, consider testing a mix, where your hardware RAID controller does
 the mirroring and the OS stripes ((R)AID 0) over the top of it.  I've
 gotten good performance from mediocre hardware cards doing this.  It has
 the advantage of still being able to use the battery backed cache and
 its instant fsync while not relying on some cards that have issues
 layering RAID layers one atop the other.
 

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] RAID stripe size question

2006-07-18 Thread Milen Kulev
According to 
http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it seems 
to be the other way around?
(As stripe size is decreased, files are broken into smaller and smaller 
pieces. This increases the number of drives
that an average file will use to hold all the blocks containing the data of 
that file, 

-theoretically increasing transfer performance, but decreasing positioning 
performance.)

Mikael,
In OLTP you utterly need  best possible latency.  If you decompose the response 
time if you physical request you will
see positioning performance plays the dominant role in the response time 
(ignore for a moment caches and their effects).

So, if you need really good response times of your SQL queries, choose  15 rpm 
disks(and add as much cache as possible
to magnify the effect ;) )

Best Regards. 
Milen 


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] RAID stripe size question

2006-07-17 Thread Alex Turner
With 18 disks dedicated to data, you could make 100/7*9 seeks/second (7ms av seeks time, 9 independant units) which is 128seeks/second writing on average 64kb of data, which is 4.1MB/sec throughput worst case, probably 10x best case so 40Mb/sec - you might want to take more disks for your data and less for your WAL.
Someone check my math here...And as always - run benchmarks with your app to verifyAlex.On 7/16/06, Mikael Carneholm 
[EMAIL PROTECTED] wrote:









I have finally gotten my hands on the MSA1500 that we ordered some time ago. It has 28 x 10K 146Gb drives, currently grouped as 10 (for wal) + 18 (for data). There's only one controller (an emulex), but I hope performance won't suffer too much from that. Raid level is 0+1, filesystem is ext3. 


Now to the interesting part: would it make sense to use different stripe sizes on the separate disk arrays? In theory, a smaller stripe size (8-32K) should increase sequential write throughput at the cost of decreased positioning performance, which sounds good for WAL (assuming WAL is never searched during normal operation). And for disks holding the data, a larger stripe size (32K) should provide for more concurrent (small) reads/writes at the cost of decreased raw throughput. This is with an OLTP type application in mind, so I'd rather have high transaction throughput than high sequential read speed. The interface is a 2Gb FC so I'm throttled to (theoretically) 192Mb/s, anyway.


So, does this make sense? Has anyone tried it and seen any performance gains from it?


Regards,

Mikael.







Re: [PERFORM] RAID stripe size question

2006-07-17 Thread Mikael Carneholm
Yeah, it seems to be a waste of disk space (spindles as well?). I was
unsure how much activity the WAL disks would have compared to the data
disks, so I created an array from 10 disks as the application is very
write intense (many spindles / high throughput is crucial). I guess that
a mirror of two disks is enough from a disk space perspective, but from
a throughput perspective it will limit me to ~25Mb/s (roughly
calculated). 

An 0+1 array of 4 disks *could* be enough, but I'm still unsure how WAL
activity correlates to normal data activity (is it 1:1, 1:2, 1:4,
...?) 

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Michael
Stone
Sent: den 17 juli 2006 02:04
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] RAID stripe size question

On Mon, Jul 17, 2006 at 12:52:17AM +0200, Mikael Carneholm wrote:
I have finally gotten my hands on the MSA1500 that we ordered some time

ago. It has 28 x 10K 146Gb drives, currently grouped as 10 (for wal) +
18 (for data). There's only one controller (an emulex), but I hope

You've got 1.4TB assigned to the WAL, which doesn't normally have more
than a couple of gigs?

Mike Stone

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] RAID stripe size question

2006-07-17 Thread Markus Schaber
Hi, Mikael,

Mikael Carneholm wrote:
 An 0+1 array of 4 disks *could* be enough, but I'm still unsure how WAL
 activity correlates to normal data activity (is it 1:1, 1:2, 1:4,
 ...?) 

I think the main difference is that the WAL activity is mostly linear,
where the normal data activity is rather random access. Thus, a mirror
of few disks (or, with good controller hardware, raid6 on 4 disks or so)
for WAL should be enough to cope with a large set of data and index
disks, who have a lot more time spent in seeking.

Btw, it may make sense to spread different tables or tables and indices
onto different Raid-Sets, as you seem to have enough spindles.

And look into the commit_delay/commit_siblings settings, they allow you
to deal latency for throughput (means a little more latency per
transaction, but much more transactions per second throughput for the
whole system.)


HTH,
Markus

-- 
Markus Schaber | Logical TrackingTracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] RAID stripe size question

2006-07-17 Thread Mikael Carneholm
I think the main difference is that the WAL activity is mostly linear,
where the normal data activity is rather random access. 

That was what I was expecting, and after reading
http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html I
figured that a different stripe size for the WAL set could be worth
investigating. I have now dropped the old sets (10+18) and created two
new raid1+0 sets (4 for WAL, 24 for data) instead. Bonnie++ is still
running, but I'll post the numbers as soon as it has finished. I did
actually use different stripe sizes for the sets as well, 8k for the WAL
disks and 64k for the data. It's quite painless to do these things with
HBAnywhere, so it's no big deal if I have to go back to another
configuration. The battery cache only has 256Mb though and that botheres
me, I assume a larger (512Mb - 1Gb) cache would make quite a difference.
Oh well.

Btw, it may make sense to spread different tables or tables and indices
onto different Raid-Sets, as you seem to have enough spindles.

This is something I'd also would like to test, as a common best-practice
these days is to go for a SAME (stripe all, mirror everything) setup.
From a development perspective it's easier to use SAME as the developers
won't have to think about physical location for new tables/indices, so
if there's no performance penalty with SAME I'll gladly keep it that
way.

And look into the commit_delay/commit_siblings settings, they allow you
to deal latency for throughput (means a little more latency per
transaction, but much more transactions per second throughput for the
whole system.)

In a previous test, using cd=5000 and cs=20 increased transaction
throughput by ~20% so I'll definitely fiddle with that in the coming
tests as well.

Regards,
Mikael.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] RAID stripe size question

2006-07-17 Thread Markus Schaber
Hi, Mikael,

Mikael Carneholm wrote:

 This is something I'd also would like to test, as a common best-practice
 these days is to go for a SAME (stripe all, mirror everything) setup.
 From a development perspective it's easier to use SAME as the developers
 won't have to think about physical location for new tables/indices, so
 if there's no performance penalty with SAME I'll gladly keep it that
 way.

Usually, it's not the developers task to care about that, but the DBAs
responsibility.

 And look into the commit_delay/commit_siblings settings, they allow you
 to deal latency for throughput (means a little more latency per
 transaction, but much more transactions per second throughput for the
 whole system.)
 
 In a previous test, using cd=5000 and cs=20 increased transaction
 throughput by ~20% so I'll definitely fiddle with that in the coming
 tests as well.

How many parallel transactions do you have?

Markus



-- 
Markus Schaber | Logical TrackingTracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] RAID stripe size question

2006-07-17 Thread Mikael Carneholm
 This is something I'd also would like to test, as a common 
 best-practice these days is to go for a SAME (stripe all, mirror
everything) setup.
 From a development perspective it's easier to use SAME as the 
 developers won't have to think about physical location for new 
 tables/indices, so if there's no performance penalty with SAME I'll 
 gladly keep it that way.

Usually, it's not the developers task to care about that, but the DBAs
responsibility.

As we don't have a full-time dedicated DBA (although I'm the one who do
most DBA related tasks) I would aim for making physical location as
transparent as possible, otherwise I'm afraid I won't be doing anything
else than supporting developers with that - and I *do* have other things
to do as well :)

 In a previous test, using cd=5000 and cs=20 increased transaction 
 throughput by ~20% so I'll definitely fiddle with that in the coming 
 tests as well.

How many parallel transactions do you have?

That was when running BenchmarkSQL
(http://sourceforge.net/projects/benchmarksql) with 100 concurrent users
(terminals), which I assume means 100 parallel transactions at most.
The target application for this DB has 3-4 times as many concurrent
connections so it's possible that one would have to find other cs/cd
numbers better suited for that scenario. Tweaking bgwriter is another
task I'll look into as well..

Btw, here's the bonnie++ results from two different array sets (10+18,
4+24) on the MSA1500:

LUN: WAL, 10 disks, stripe size 32K

Version  1.03   --Sequential Output-- --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell0132G 56139  93 73250  22 16530   3 30488  45 57489   5
477.3   1
--Sequential Create-- Random
Create
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
 16  2458  90 + +++ + +++  3121  99 + +++
10469  98


LUN: WAL, 4 disks, stripe size 8K
--
Version  1.03   --Sequential Output-- --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell0132G 49170  82 60108  19 13325   2 15778  24 21489   2
266.4   0
--Sequential Create-- Random
Create
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
 16  2432  86 + +++ + +++  3106  99 + +++
10248  98


LUN: DATA, 18 disks, stripe size 32K
-
Version  1.03   --Sequential Output-- --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell0132G 59990  97 87341  28 19158   4 30200  46 57556   6
495.4   1
--Sequential Create-- Random
Create
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
 16  1640  92 + +++ + +++  1736  99 + +++
10919  99


LUN: DATA, 24 disks, stripe size 64K
-
Version  1.03   --Sequential Output-- --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell0132G 59443  97 118515  39 25023   5 30926  49 60835   6
531.8   1
--Sequential Create-- Random
Create
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
 16  2499  90 + +++ + +++  2817  99 + +++
10971 100

Regards,
Mikael

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] RAID stripe size question

2006-07-17 Thread Ron Peacetree
From: Mikael Carneholm [EMAIL PROTECTED]
Sent: Jul 16, 2006 6:52 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] RAID stripe size question

I have finally gotten my hands on the MSA1500 that we ordered some time
ago. It has 28 x 10K 146Gb drives,

Unless I'm missing something, the only FC or SCSI HDs of ~147GB capacity are 
15K, not 10K.
(unless they are old?)
I'm not just being pedantic.  The correct, let alone optimal, answer to your 
question depends on your exact HW characteristics as well as your SW config and 
your usage pattern.
15Krpm HDs will have average access times of 5-6ms.  10Krpm ones of 7-8ms.
Most modern HDs in this class will do ~60MB/s inner tracks ~75MB/s avg and 
~90MB/s outer tracks.

If you are doing OLTP-like things, you are more sensitive to latency than most 
and should use the absolute lowest latency HDs available within you budget.  
The current latency best case is 15Krpm FC HDs.


currently grouped as 10 (for wal) + 18 (for data). There's only one controller 
(an emulex), but I hope
performance won't suffer too much from that. Raid level is 0+1,
filesystem is ext3. 

I strongly suspect having only 1 controller is an I/O choke w/ 28 HDs.

28HDs as above setup as 2 RAID 10's = ~75MBps*5= ~375MB/s,  ~75*9= ~675MB/s.
If both sets are to run at peak average speed, the Emulex would have to be able 
to handle ~1050MBps on average.
It is doubtful the 1 Emulex can do this.

In order to handle this level of bandwidth, a RAID controller must aggregate 
multiple FC, SCSI, or SATA streams as well as down any RAID 5 checksumming etc 
that is required.
Very, very few RAID controllers can do = 1GBps 
One thing that help greatly with bursty IO patterns is to up your battery 
backed RAID cache as high as you possibly can.  Even multiple GBs of BBC can be 
worth it.  Another reason to have multiple controllers ;-)

Then there is the question of the BW of the bus that the controller is plugged 
into.
~800MB/s is the RW max to be gotten from a 64b 133MHz PCI-X channel.
PCI-E channels are usually good for 1/10 their rated speed in bps as Bps.
So a PCI-Ex4 10Gbps bus can be counted on for 1GBps, PCI-Ex8 for 2GBps, etc.
At present I know of no RAID controllers that can singlely saturate a PCI-Ex4 
or greater bus.

...and we haven't even touched on OS, SW, and usage pattern issues.

Bottom line is that the IO chain is only as fast as its slowest component.


Now to the interesting part: would it make sense to use different stripe
sizes on the separate disk arrays? 

The short answer is Yes.
WAL's are basically appends that are written in bursts of your chosen log chunk 
size and that are almost never read afterwards.  Big DB pages and big RAID 
stripes makes sense for WALs.

Tables with OLTP-like characteristics need smaller DB pages and stripes to 
minimize latency issues (although locality of reference can make the optimum 
stripe size larger).

Tables with Data Mining like characteristics usually work best with larger DB 
pages sizes and RAID stripe sizes.

OS and FS overhead can make things more complicated.  So can DB layout and 
access pattern issues.

Side note: a 10 HD RAID 10 seems a bit much for WAL.  Do you really need 
375MBps IO on average to your WAL more than you need IO capacity for other 
tables?
If WAL IO needs to be very high, I'd suggest getting a SSD or SSD-like device 
that fits your budget and having said device async mirror to HD. 

Bottom line is to optimize your RAID stripe sizes =after= you optimize your OS, 
FS, and pg design for best IO for your usage pattern(s).

Hope this helps,
Ron

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] RAID stripe size question

2006-07-17 Thread Steinar H. Gunderson
On Mon, Jul 17, 2006 at 09:40:30AM -0400, Ron Peacetree wrote:
 Unless I'm missing something, the only FC or SCSI HDs of ~147GB capacity are 
 15K, not 10K.
 (unless they are old?)

There are still 146GB SCSI 1rpm disks being sold here, at least.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] RAID stripe size question

2006-07-17 Thread Alex Turner
On 7/17/06, Mikael Carneholm [EMAIL PROTECTED] wrote:
 This is something I'd also would like to test, as a common best-practice these days is to go for a SAME (stripe all, mirroreverything) setup. From a development perspective it's easier to use SAME as the
 developers won't have to think about physical location for new tables/indices, so if there's no performance penalty with SAME I'll gladly keep it that way.Usually, it's not the developers task to care about that, but the DBAs
responsibility.As we don't have a full-time dedicated DBA (although I'm the one who domost DBA related tasks) I would aim for making physical location astransparent as possible, otherwise I'm afraid I won't be doing anything
else than supporting developers with that - and I *do* have other thingsto do as well :) In a previous test, using cd=5000 and cs=20 increased transaction throughput by ~20% so I'll definitely fiddle with that in the coming
 tests as well.How many parallel transactions do you have?That was when running BenchmarkSQL(http://sourceforge.net/projects/benchmarksql
) with 100 concurrent users(terminals), which I assume means 100 parallel transactions at most.The target application for this DB has 3-4 times as many concurrentconnections so it's possible that one would have to find other cs/cd
numbers better suited for that scenario. Tweaking bgwriter is anothertask I'll look into as well..Btw, here's the bonnie++ results from two different array sets (10+18,4+24) on the MSA1500:LUN: WAL, 10 disks, stripe size 32K
Version1.03 --Sequential Output-- --Sequential Input-
--Random--Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CPsesell0132G 5613993 7325022 16530 3 3048845 57489 5
477.3 1--Sequential Create-- Random
Create
-Create-- --Read--- -Delete-- -Create-- --ReadDelete--
files/sec %CP/sec %CP/sec %CP/sec %CP/sec %CP
/sec %CP 16245890 + +++ + +++312199 + +++
1046998LUN: WAL, 4 disks, stripe size 8K--Version1.03 --Sequential Output-- --Sequential Input-
--Random--Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CPsesell0132G 4917082 6010819 13325 2 1577824 21489 2
266.4 0--Sequential Create-- Random
Create
-Create-- --Read--- -Delete-- -Create-- --ReadDelete--
files/sec %CP/sec %CP/sec %CP/sec %CP/sec %CP
/sec %CP 16243286 + +++ + +++310699 + +++
1024898LUN: DATA, 18 disks, stripe size 32K-Version1.03
 --Sequential Output-- --Sequential Input---Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CPsesell0132G 5999097 8734128 19158 4 3020046 57556 6
495.4 1--Sequential Create-- Random
Create
-Create-- --Read--- -Delete-- -Create-- --ReadDelete--
files/sec %CP/sec %CP/sec %CP/sec %CP/sec %CP
/sec %CP 16164092 + +++ + +++173699 + +++
1091999LUN: DATA, 24 disks, stripe size 64K-Version1.03
 --Sequential Output-- --Sequential Input---Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CPsesell0132G 5944397 11851539 25023 5 3092649 60835 6
531.8 1--Sequential Create-- Random
Create
-Create-- --Read--- -Delete-- -Create-- --ReadDelete--
files/sec %CP/sec %CP/sec %CP/sec %CP/sec %CP
/sec %CP 16249990 + +++ + +++281799 + +++
10971 100These bonnie++ number are very worrying. Your controller should easily max out your FC interface on these tests passing 192MB/sec with ease on anything more than an 6 drive RAID 10 . This is a bad omen if you want high performance... Each mirror pair can do 60-80MB/sec. A 24Disk RAID 10 can do 12*60MB/sec which is 740MB/sec - I have seen this performance, it's not unreachable, but time and again, we see these bad perf numbers from FC and SCSI systems alike. Consider a different controller, because this one is not up to snuff. A single drive would get better numbers than your 4 disk RAID 10, 21MB/sec read speed is really pretty sorry, it should be closer to 120Mb/sec. If you can't swap out, software RAID may turn out to be your friend. The only saving grace is that this is OLTP, and perhaps, just maybe, the controller will be better at ordering IOs, but I highly doubt it.
Please people, do the numbers, benchmark before you buy, many many HBAs really suck under Linux/Free BSD, and you may end up paying vast sums of money for very sub-optimal performance (I'd say sub-standard, but alas, it seems that this kind of poor performance is tolerated, even though it's way off where it should be). There's no point having a 40disk cab, if your controller can't handle 

Re: [PERFORM] RAID stripe size question

2006-07-17 Thread Mikael Carneholm
Unless I'm missing something, the only FC or SCSI HDs of ~147GB capacity are 
15K, not 10K.

In the spec we got from HP, they are listed as model 286716-B22 
(http://www.dealtime.com/xPF-Compaq_HP_146_8_GB_286716_B22) which seems to run 
at 10K. Don't know how old those are, but that's what we got from HP anyway.

15Krpm HDs will have average access times of 5-6ms.  10Krpm ones of 7-8ms.

Average seek time for that disk is listed as 4.9ms, maybe sounds a bit 
optimistic?

 28HDs as above setup as 2 RAID 10's = ~75MBps*5= ~375MB/s,  ~75*9= ~675MB/s.

I guess it's still limited by the 2Gbit FC (192Mb/s), right?

Very, very few RAID controllers can do = 1GBps One thing that help greatly 
with bursty IO patterns is to up your battery backed RAID cache as high as you 
possibly can.  Even multiple GBs of BBC can be worth it.  Another reason to 
have multiple controllers ;-)

I use 90% of the raid cache for writes, don't think I could go higher than 
that. Too bad the emulex only has 256Mb though :/

Then there is the question of the BW of the bus that the controller is plugged 
into.
~800MB/s is the RW max to be gotten from a 64b 133MHz PCI-X channel.
PCI-E channels are usually good for 1/10 their rated speed in bps as Bps.
So a PCI-Ex4 10Gbps bus can be counted on for 1GBps, PCI-Ex8 for 2GBps, etc.
At present I know of no RAID controllers that can singlely saturate a PCI-Ex4 
or greater bus.

The controller is a FC2143 
(http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=configProductLineId=450FamilyId=1449BaseId=17621oi=E9CEDBEID=19701SBLID=),
 which uses PCI-E. Don't know how it compares to other controllers, haven't had 
the time to search for / read any reviews yet.

Now to the interesting part: would it make sense to use different 
stripe sizes on the separate disk arrays?

The short answer is Yes.

Ok

WAL's are basically appends that are written in bursts of your chosen log 
chunk size and that are almost never read afterwards.  Big DB pages and big 
RAID stripes makes sense for WALs.

According to 
http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it seems 
to be the other way around? (As stripe size is decreased, files are broken 
into smaller and smaller pieces. This increases the number of drives that an 
average file will use to hold all the blocks containing the data of that file, 
theoretically increasing transfer performance, but decreasing positioning 
performance.)

I guess I'll have to find out which theory that holds by good ol´ trial and 
error... :)

- Mikael

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] RAID stripe size question

2006-07-17 Thread Mark Kirkwood

Mikael Carneholm wrote:


Btw, here's the bonnie++ results from two different array sets (10+18,
4+24) on the MSA1500:


LUN: DATA, 24 disks, stripe size 64K
-
Version  1.03   --Sequential Output-- --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell0132G 59443  97 118515  39 25023   5 30926  49 60835   6
531.8   1
--Sequential Create-- Random
Create
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
 16  2499  90 + +++ + +++  2817  99 + +++
10971 100




It might be interesting to see if 128K or 256K stripe size gives better 
sequential throughput, while still leaving the random performance ok. 
Having said that, the seeks/s figure of 531 not that great - for 
instance I've seen a 12 disk (15K SCSI) system report about 1400 seeks/s 
in this test.


Sorry if you mentioned this already - but what OS and filesystem are you 
using? (if Linux and ext3, it might be worth experimenting with xfs or jfs).


Cheers

Mark

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] RAID stripe size question

2006-07-17 Thread Ron Peacetree
-Original Message-
From: Mikael Carneholm [EMAIL PROTECTED]
Sent: Jul 17, 2006 5:16 PM
To: Ron  Peacetree [EMAIL PROTECTED], pgsql-performance@postgresql.org
Subject: RE: [PERFORM] RAID stripe size question

15Krpm HDs will have average access times of 5-6ms.  10Krpm ones of 7-8ms.

Average seek time for that disk is listed as 4.9ms, maybe sounds a bit 
optimistic?

Ah, the games vendors play.  average seek time for a 10Krpm HD may very well 
be 4.9ms.  However, what matters to you the user is average =access= time.  
The 1st is how long it takes to position the heads to the correct track.  The 
2nd is how long it takes to actually find and get data from a specified HD 
sector.

 28HDs as above setup as 2 RAID 10's = ~75MBps*5= ~375MB/s,  ~75*9= ~675MB/s.

I guess it's still limited by the 2Gbit FC (192Mb/s), right?

No.  A decent HBA has multiple IO channels on it.  So for instance Areca's 
ARC-6080 (8/12/16-port 4Gbps Fibre-to-SATA ll Controller) has 2 4Gbps FCs in it 
(...and can support up to 4GB of BB cache!).  Nominally, this card can push 
8Gbps= 800MBps.  ~600-700MBps is the RW number.

Assuming ~75MBps ASTR per HD, that's ~ enough bandwidth for a 16 HD RAID 10 set 
per ARC-6080.  

Very, very few RAID controllers can do = 1GBps One thing that help greatly 
with 
bursty IO patterns is to up your battery backed RAID cache as high as you 
possibly
can.  Even multiple GBs of BBC can be worth it.  
Another reason to have multiple controllers ;-)

I use 90% of the raid cache for writes, don't think I could go higher than 
that. 
Too bad the emulex only has 256Mb though :/

If your RAID cache hit rates are in the 90+% range, you probably would find it 
profitable to make it greater.  I've definitely seen access patterns that 
benefitted from increased RAID cache for any size I could actually install.  
For those access patterns, no amount of RAID cache commercially available was 
enough to find the flattening point of the cache percentage curve.  256MB of 
BB RAID cache per HBA is just not that much for many IO patterns.


The controller is a FC2143 
(http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=configProductLineId=450FamilyId=1449BaseId=17621oi=E9CEDBEID=19701SBLID=),
 which uses PCI-E. Don't know how it compares to other controllers, haven't 
had the time to search for / read any reviews yet.

This is a relatively low end HBA with 1 4Gb FC on it.  Max sustained IO on it 
is going to be ~320MBps.  Or ~ enough for an 8 HD RAID 10 set made of 75MBps 
ASTR HD's.

28 such HDs are =definitely= IO choked on this HBA.  

The arithmatic suggests you need a better HBA or more HBAs or both.


WAL's are basically appends that are written in bursts of your chosen log 
chunk size and that are almost never read afterwards.  Big DB pages and big 
RAID stripes makes sense for WALs.

According to 
http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it seems 
to be the other way around? (As stripe size is decreased, files are broken 
into smaller and smaller pieces. This increases the number of drives that an 
average file will use to hold all the blocks containing the data of that file, 
theoretically increasing transfer performance, but decreasing positioning 
performance.)

I guess I'll have to find out which theory that holds by good ol� trial and 
error... :)

IME, stripe sizes of 64, 128, or 256 are the most common found to be optimal 
for most access patterns + SW + FS + OS + HW.


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] RAID stripe size question

2006-07-17 Thread Alex Turner
On 7/17/06, Ron Peacetree [EMAIL PROTECTED] wrote:
-Original Message-From: Mikael Carneholm [EMAIL PROTECTED]Sent: Jul 17, 2006 5:16 PMTo: RonPeacetree 
[EMAIL PROTECTED], pgsql-performance@postgresql.orgSubject: RE: [PERFORM] RAID stripe size question15Krpm HDs will have average access times of 5-6ms.10Krpm ones of 7-8ms.
Average seek time for that disk is listed as 4.9ms, maybe sounds a bit optimistic?Ah, the games vendors play.average seek time for a 10Krpm HD may very well be 4.9ms.However, what matters to you the user is average =access= time.The 1st is how long it takes to position the heads to the correct track.The 2nd is how long it takes to actually find and get data from a specified HD sector.
 28HDs as above setup as 2 RAID 10's = ~75MBps*5= ~375MB/s,~75*9= ~675MB/s.I guess it's still limited by the 2Gbit FC (192Mb/s), right?No.A decent HBA has multiple IO channels on it.So for instance Areca's ARC-6080 (8/12/16-port 4Gbps Fibre-to-SATA ll Controller) has 2 4Gbps FCs in it (...and can support up to 4GB of BB cache!).Nominally, this card can push 8Gbps= 800MBps.~600-700MBps is the RW number.
Assuming ~75MBps ASTR per HD, that's ~ enough bandwidth for a 16 HD RAID 10 set per ARC-6080.Very, very few RAID controllers can do = 1GBps One thing that help greatly withbursty IO patterns is to up your battery backed RAID cache as high as you possibly
can.Even multiple GBs of BBC can be worth it.Another reason to have multiple controllers ;-)I use 90% of the raid cache for writes, don't think I could go higher than that.Too bad the emulex only has 256Mb though :/
If your RAID cache hit rates are in the 90+% range, you probably would find it profitable to make it greater.I've definitely seen access patterns that benefitted from increased RAID cache for any size I could actually install.For those access patterns, no amount of RAID cache commercially available was enough to find the flattening point of the cache percentage curve.256MB of BB RAID cache per HBA is just not that much for many IO patterns.
90% as in 90% of the RAM, not 90% hit rate I'm imagining.
The controller is a FC2143 (http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=configProductLineId=450FamilyId=1449BaseId=17621oi=E9CEDBEID=19701SBLID=
), which uses PCI-E. Don't know how it compares to other controllers, haven't had the time to search for / read any reviews yet.This is a relatively low end HBA with 1 4Gb FC on it.Max sustained IO on it is going to be ~320MBps.Or ~ enough for an 8 HD RAID 10 set made of 75MBps ASTR HD's.
28 such HDs are =definitely= IO choked on this HBA.Not they aren't. This is OLTP, not data warehousing. I already posted math for OLTP throughput, which is in the order of 8-80MB/second actual data throughput based on maximum theoretical seeks/second.
The arithmatic suggests you need a better HBA or more HBAs or both.
WAL's are basically appends that are written in bursts of your chosen log chunk size and that are almost never read afterwards.Big DB pages and big RAID stripes makes sense for WALs.unless of course you are running OLTP, in which case a big stripe isn't necessary, spend the disks on your data parition, because your WAL activity is going to be small compared with your random IO. 
According to 
http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it seems to be the other way around? (As stripe size is decreased, files are broken into smaller and smaller pieces. This increases the number of drives that an average file will use to hold all the blocks containing the data of that file, theoretically increasing transfer performance, but decreasing positioning performance.)
I guess I'll have to find out which theory that holds by good ol� trial and error... :)IME, stripe sizes of 64, 128, or 256 are the most common found to be optimal for most access patterns + SW + FS + OS + HW.
New records will be posted at the end of a file, and will only increase the file by the number of blocks in the transactions posted at write time. Updated records are modified in place unless they have grown too big to be in place. If you are updated mutiple tables on each transaction, a 64kb stripe size or lower is probably going to be best as block sizes are just 8kb. How much data does your average transaction write? How many xacts per second, this will help determine how many writes your cache will queue up before it flushes, and therefore what the optimal stripe size will be. Of course, the fastest and most accurate way is probably just to try different settings and see how it works. Alas some controllers seem to handle some stripe sizes more effeciently in defiance of any logic.
Work out how big your xacts are, how many xacts/second you can post, and you will figure out how fast WAL will be writting. Allocate enough disk for peak load plus planned expansion on WAL and then put the rest to tablespace. You may well find that a single RAID 1 is enough for WAL (if you acheive theoretical performance levels, 

Re: [PERFORM] RAID stripe size question

2006-07-16 Thread Steinar H. Gunderson
On Mon, Jul 17, 2006 at 12:52:17AM +0200, Mikael Carneholm wrote:
 Now to the interesting part: would it make sense to use different stripe
 sizes on the separate disk arrays? In theory, a smaller stripe size
 (8-32K) should increase sequential write throughput at the cost of
 decreased positioning performance, which sounds good for WAL (assuming
 WAL is never searched during normal operation).

For large writes (ie. sequential write throughput), it doesn't really matter
what the stripe size is; all the disks will have to both seek and write
anyhow.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] RAID stripe size question

2006-07-16 Thread Michael Stone

On Mon, Jul 17, 2006 at 12:52:17AM +0200, Mikael Carneholm wrote:

I have finally gotten my hands on the MSA1500 that we ordered some time
ago. It has 28 x 10K 146Gb drives, currently grouped as 10 (for wal) +
18 (for data). There's only one controller (an emulex), but I hope


You've got 1.4TB assigned to the WAL, which doesn't normally have more 
than a couple of gigs?


Mike Stone

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match