subject:"Re\\\: \\\[PERFORM\\\] How to improve db performance with $7K\\\?"

On Thu, Apr 14, 2005 at 10:51:46AM -0500, Matthew Nuzum wrote:
 So if you all were going to choose between two hard drives where:
 drive A has capacity C and spins at 15K rpms, and
 drive B has capacity 2 x C and spins at 10K rpms and
 all other features are the same, the price is the same and C is enough
 disk space which would you choose?
 
 I've noticed that on IDE drives, as the capacity increases the data
 density increases and there is a pereceived (I've not measured it)
 performance increase.
 
 Would the increased data density of the higher capacity drive be of
 greater benefit than the faster spindle speed of drive A?

The increased data density will help transfer speed off the platter, but
that's it. It won't help rotational latency.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] How to improve db performance with $7K?

On Mon, Apr 18, 2005 at 07:41:49PM +0200, Jacques Caron wrote:
 It would be interesting to actually compare this to real-world (or 
 nearly-real-world) benchmarks to measure the effectiveness of features like 
 TCQ/NCQ etc.

I was just thinking that it would be very interesting to benchmark
different RAID configurations using dbt2. I don't know if this is
something that the lab is setup for or capable of, though.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] How to improve db performance with $7K?

On Mon, Apr 18, 2005 at 10:20:36AM -0500, Dave Held wrote:
 Hmm...so you're saying that at some point, quantity beats quality?
 That's an interesting point.  However, it presumes that you can
 actually distribute your data over a larger number of drives.  If
 you have a db with a bottleneck of one or two very large tables,
 the extra spindles won't help unless you break up the tables and
 glue them together with query magic.  But it's still a point to
 consider.

Huh? Do you know how RAID10 works?
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] How to improve db performance with $7K?

On Mon, Apr 18, 2005 at 06:41:37PM -, Mohan, Ross wrote:
 Don't you think optimal stripe width would be
 a good question to research the binaries for? I'd
 think that drives the answer, largely.  (uh oh, pun alert)
 
 EG, oracle issues IO requests (this may have changed _just_ 
 recently) in 64KB chunks, regardless of what you ask for. 
 So when I did my striping (many moons ago, when the Earth 
 was young...) I did it in 128KB widths, and set the oracle 
 multiblock read count according. For oracle, any stripe size
 under 64KB=stupid, anything much over 128K/258K=wasteful. 
 
 I am eager to find out how PG handles all this. 

AFAIK PostgreSQL requests data one database page at a time (normally
8k). Of course the OS might do something different.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] How to improve db performance with $7K?

On Tue, Apr 19, 2005 at 11:22:17AM -0500, [EMAIL PROTECTED] wrote:
 
 
 [EMAIL PROTECTED] wrote on 04/19/2005 11:10:22 AM:
 
  What is 'multiple initiators' used for in the real world?
 
 I asked this same question and got an answer off list:  Somebody said their
 SAN hardware used multiple initiators.  I would try to check the archives
 for you, but this thread is becoming more of a rope.
 
 Multiple initiators means multiple sources on the bus issuing I/O
 instructions to the drives.  In theory you can have two computers on the
 same SCSI bus issuing I/O requests to the same drive, or to anything else
 on the bus, but I've never seen this implemented.  Others have noted this
 feature as being a big deal, so somebody is benefiting from it.

It's a big deal for Oracle clustering, which relies on shared drives. Of
course most people doing Oracle clustering are probably using a SAN and
not raw SCSI...
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] How to improve db performance with $7K?

2005-04-18 Thread William Yu

Problem with this strategy. You want battery-backed write caching for 
best performance  safety. (I've tried IDE for WAL before w/ write 
caching off -- the DB got crippled whenever I had to copy files from/to 
the drive on the WAL partition -- ended up just moving WAL back on the 
same SCSI drive as the main DB.) That means in addition to a $$$ SCSI 
caching controller, you also need a $$$ SATA caching controller. From my 
glance at prices, advanced SATA controllers seem to cost nearly as their 
SCSI counterparts.

This also looks to be the case for the drives themselves. Sure you can 
get super cheap 7200RPM SATA drives but they absolutely suck for 
database work. Believe me, I gave it a try once -- ugh. The highend WD 
10K Raptors look pretty good though -- the benchmarks @ storagereview 
seem to put these drives at about 90% of SCSI 10Ks for both single-user 
and multi-user. However, they're also priced like SCSIs -- here's what I 
found @ Mwave (going through pricewatch to find WD740GDs):

Seagate 7200 SATA -- 80GB$59
WD 10K SATA   -- 72GB$182
Seagate 10K U320  -- 72GB$289
Using the above prices for a fixed budget for RAID-10, you could get:
SATA 7200 -- 680MB per $1000
SATA 10K  -- 200MB per $1000
SCSI 10K  -- 125MB per $1000
For a 99% read-only DB that required lots of disk space (say something 
like Wikipedia or blog host), using consumer level SATA probably is ok. 
For anything else, I'd consider SATA 10K if (1) I do not need 15K RPM 
and (2) I don't have SCSI intrastructure already.

Steve Poe wrote:
If SATA drives don't have the ability to replace SCSI for a multi-user
Postgres apps, but you needed to save on cost (ALWAYS an issue), 
could/would you implement SATA for your logs (pg_xlog) and keep the rest 
on SCSI?

Steve Poe
Mohan, Ross wrote:
I've been doing some reading up on this, trying to keep up here, and 
have found out that (experts, just yawn and cover your ears)

1) some SATA drives (just type II, I think?) have a Phase Zero
   implementation of Tagged Command Queueing (the special sauce
   for SCSI).
2) This SATA TCQ is called NCQ and I believe it basically
   allows the disk software itself to do the reordering
   (this is called simple in TCQ terminology) It does not
   yet allow the TCQ head of queue command, allowing the
   current tagged request to go to head of queue, which is
   a simple way of manifesting a high priority request.
3) SATA drives are not yet multi-initiator?
Largely b/c of 2 and 3, multi-initiator SCSI RAID'ed drives
are likely to whomp SATA II drives for a while yet (read: a
year or two) in multiuser PostGres applications.
-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Greg Stark
Sent: Thursday, April 14, 2005 2:04 PM
To: Kevin Brown
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] How to improve db performance with $7K?

Kevin Brown [EMAIL PROTECTED] writes:
 

Greg Stark wrote:
  

I think you're being misled by analyzing the write case.
Consider the read case. When a user process requests a block and 
that read makes its way down to the driver level, the driver can't 
just put it aside and wait until it's convenient. It has to go ahead 
and issue the read right away.

Well, strictly speaking it doesn't *have* to.  It could delay for a 
couple of milliseconds to see if other requests come in, and then 
issue the read if none do.  If there are already other requests being 
fulfilled, then it'll schedule the request in question just like the 
rest.
  

But then the cure is worse than the disease. You're basically 
describing exactly what does happen anyways, only you're delaying more 
requests than necessary. That intervening time isn't really idle, it's 
filled with all the requests that were delayed during the previous 
large seek...

 

Once the first request has been fulfilled, the driver can now 
schedule the rest of the queued-up requests in disk-layout order.

I really don't see how this is any different between a system that 
has tagged queueing to the disks and one that doesn't.  The only 
difference is where the queueing happens.
  

And *when* it happens. Instead of being able to issue requests while a 
large seek is happening and having some of them satisfied they have to 
wait until that seek is finished and get acted on during the next 
large seek.

If my theory is correct then I would expect bandwidth to be 
essentially equivalent but the latency on SATA drives to be increased 
by about 50% of the average seek time. Ie, while a busy SCSI drive can 
satisfy most requests in about 10ms a busy SATA drive would satisfy 
most requests in 15ms. (add to that that 10k RPM and 15kRPM SCSI 
drives have even lower seek times and no such IDE/SATA drives exist...)

In reality higher latency feeds into a system feedback loop causing 
your application to run slower causing bandwidth demands to be lower 
as well. It's often hard to distinguish root causes from symptoms when 
optimizing

Re: [PERFORM] How to improve db performance with $7K?

2005-04-18 Thread Greg Stark


William Yu [EMAIL PROTECTED] writes:

 Using the above prices for a fixed budget for RAID-10, you could get:
 
 SATA 7200 -- 680MB per $1000
 SATA 10K  -- 200MB per $1000
 SCSI 10K  -- 125MB per $1000

What a lot of these analyses miss is that cheaper == faster because cheaper
means you can buy more spindles for the same price. I'm assuming you picked
equal sized drives to compare so that 200MB/$1000 for SATA is almost twice as
many spindles as the 125MB/$1000. That means it would have almost double the
bandwidth. And the 7200 RPM case would have more than 5x the bandwidth.

While 10k RPM drives have lower seek times, and SCSI drives have a natural
seek time advantage, under load a RAID array with fewer spindles will start
hitting contention sooner which results into higher latency. If the controller
works well the larger SATA arrays above should be able to maintain their
mediocre latency much better under load than the SCSI array with fewer drives
would maintain its low latency response time despite its drives' lower average
seek time.

-- 
greg


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] How to improve db performance with $7K?

This is fundamentaly untrue.

A mirror is still a mirror.  At most in a RAID 10 you can have two
simultaneous seeks.  You are always going to be limited by the seek
time of your drives.  It's a stripe, so you have to read from all
members of the stripe to get data, requiring all drives to seek. 
There is no advantage to seek time in adding more drives.  By adding
more drives you can increase throughput, but the max throughput of the
PCI-X bus isn't that high (I think around 400MB/sec)  You can easily
get this with a six or seven drive RAID 5, or a ten drive RAID 10.  At
that point you start having to factor in the cost of a bigger chassis
to hold more drives, which can be big bucks.

Alex Turner
netEconomist

On 18 Apr 2005 10:59:05 -0400, Greg Stark [EMAIL PROTECTED] wrote:
 
 William Yu [EMAIL PROTECTED] writes:
 
  Using the above prices for a fixed budget for RAID-10, you could get:
 
  SATA 7200 -- 680MB per $1000
  SATA 10K  -- 200MB per $1000
  SCSI 10K  -- 125MB per $1000
 
 What a lot of these analyses miss is that cheaper == faster because cheaper
 means you can buy more spindles for the same price. I'm assuming you picked
 equal sized drives to compare so that 200MB/$1000 for SATA is almost twice as
 many spindles as the 125MB/$1000. That means it would have almost double the
 bandwidth. And the 7200 RPM case would have more than 5x the bandwidth.
 
 While 10k RPM drives have lower seek times, and SCSI drives have a natural
 seek time advantage, under load a RAID array with fewer spindles will start
 hitting contention sooner which results into higher latency. If the controller
 works well the larger SATA arrays above should be able to maintain their
 mediocre latency much better under load than the SCSI array with fewer drives
 would maintain its low latency response time despite its drives' lower average
 seek time.
 
 --
 greg
 
 
 ---(end of broadcast)---
 TIP 9: the planner will ignore your desire to choose an index scan if your
   joining column's datatypes do not match


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] How to improve db performance with $7K?

[snip]
 
 Adding drives will not let you get lower response times than the average seek
 time on your drives*. But it will let you reach that response time more often.
 
[snip]

I believe your assertion is fundamentaly flawed.  Adding more drives
will not let you reach that response time more often.  All drives are
required to fill every request in all RAID levels (except possibly
0+1, but that isn't used for enterprise applicaitons).  Most requests
in OLTP require most of the request time to seek, not to read.  Only
in single large block data transfers will you get any benefit from
adding more drives, which is atypical in most database applications. 
For most database applications, the only way to increase
transactions/sec is to decrease request service time, which is
generaly achieved with better seek times or a better controller card,
or possibly spreading your database accross multiple tablespaces on
seperate paritions.

My assertion therefore is that simply adding more drives to an already
competent* configuration is about as likely to increase your database
effectiveness as swiss cheese is to make your car run faster.

Alex Turner
netEconomist

*Assertion here is that the DBA didn't simply configure all tables and
xlog on a single 7200 RPM disk, but has seperate physical drives for
xlog and tablespace at least on 10k drives.

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] How to improve db performance with $7K?

Hi,
At 18:56 18/04/2005, Alex Turner wrote:
All drives are required to fill every request in all RAID levels
No, this is definitely wrong. In many cases, most drives don't actually 
have the data requested, how could they handle the request?

When reading one random sector, only *one* drive out of N is ever used to 
service any given request, be it RAID 0, 1, 0+1, 1+0 or 5.

When writing:
- in RAID 0, 1 drive
- in RAID 1, RAID 0+1 or 1+0, 2 drives
- in RAID 5, you need to read on all drives and write on 2.
Otherwise, what would be the point of RAID 0, 0+1 or 1+0?
Jacques.

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] How to improve db performance with $7K?

2005-04-18 Thread Alan Stange

Alex Turner wrote:
[snip]
 

Adding drives will not let you get lower response times than the average seek
time on your drives*. But it will let you reach that response time more often.
   

[snip]
I believe your assertion is fundamentaly flawed.  Adding more drives
will not let you reach that response time more often.  All drives are
required to fill every request in all RAID levels (except possibly
0+1, but that isn't used for enterprise applicaitons).  Most requests
in OLTP require most of the request time to seek, not to read.  Only
in single large block data transfers will you get any benefit from
adding more drives, which is atypical in most database applications. 
For most database applications, the only way to increase
transactions/sec is to decrease request service time, which is
generaly achieved with better seek times or a better controller card,
or possibly spreading your database accross multiple tablespaces on
seperate paritions.

My assertion therefore is that simply adding more drives to an already
competent* configuration is about as likely to increase your database
effectiveness as swiss cheese is to make your car run faster.
 

Consider the case of a mirrored file system with a mostly read() 
workload.  Typical behavior is to use a round-robin method for issueing 
the read operations to each mirror in turn, but one can use other 
methods like a geometric algorithm that will issue the reads to the 
drive with the head located closest to the desired track.Some 
systems have many mirrors of the data for exactly this behavior.   In 
fact, one can carry this logic to the extreme and have one drive for 
every cylinder in the mirror, thus removing seek latencies completely.  
In fact this extreme case would also remove the rotational latency as 
the cylinder will be in the disks read cache.  :-)   Of course, writing 
data would be a bit slow!

I'm not sure I understand your assertion that all drives are required 
to fill every request in all RAID levels.   After all, in mirrored 
reads only one mirror needs to read any given block of data, so I don't 
know what goal is achieved in making other mirrors read the same data.

My assertion (based on ample personal experience) is that one can 
*always* get improved performance by adding more drives.  Just limit the 
drives to use the first few cylinders so that the average seek time is 
greatly reduced and concatenate the drives together.  One can then build 
the usual RAID device out of these concatenated metadevices.  Yes, one 
is wasting lots of disk space, but that's life.   If your goal is 
performance, then you need to put your money on the table. The 
system will be somewhat unreliable because of the device count, 
additional SCSI buses, etc., but that too is life in the high 
performance world.

-- Alan
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] How to improve db performance with $7K?

2005-04-18 Thread John A Meinel

Alex Turner wrote:
[snip]

Adding drives will not let you get lower response times than the average seek
time on your drives*. But it will let you reach that response time more often.

[snip]
I believe your assertion is fundamentaly flawed.  Adding more drives
will not let you reach that response time more often.  All drives are
required to fill every request in all RAID levels (except possibly
0+1, but that isn't used for enterprise applicaitons).
Actually 0+1 is the recommended configuration for postgres databases
(both for xlog and for the bulk data), because the write speed of RAID5
is quite poor.
Hence you base assumption is not correct, and adding drives *does* help.
Most requests
in OLTP require most of the request time to seek, not to read.  Only
in single large block data transfers will you get any benefit from
adding more drives, which is atypical in most database applications.
For most database applications, the only way to increase
transactions/sec is to decrease request service time, which is
generaly achieved with better seek times or a better controller card,
or possibly spreading your database accross multiple tablespaces on
seperate paritions.

This is probably true. However, if you are doing lots of concurrent
connections, and things are properly spread across multiple spindles
(using RAID0+1, or possibly tablespaces across multiple raids).
Then each seek occurs on a separate drive, which allows them to occur at
the same time, rather than sequentially. Having 2 processes competing
for seeking on the same drive is going to be worse than having them on
separate drives.
John
=:-


signature.asc
Description: OpenPGP digital signature

Re: [PERFORM] How to improve db performance with $7K?

Hi,
At 16:59 18/04/2005, Greg Stark wrote:
William Yu [EMAIL PROTECTED] writes:
 Using the above prices for a fixed budget for RAID-10, you could get:

 SATA 7200 -- 680MB per $1000
 SATA 10K  -- 200MB per $1000
 SCSI 10K  -- 125MB per $1000
What a lot of these analyses miss is that cheaper == faster because cheaper
means you can buy more spindles for the same price. I'm assuming you picked
equal sized drives to compare so that 200MB/$1000 for SATA is almost twice as
many spindles as the 125MB/$1000. That means it would have almost double the
bandwidth. And the 7200 RPM case would have more than 5x the bandwidth.
While 10k RPM drives have lower seek times, and SCSI drives have a natural
seek time advantage, under load a RAID array with fewer spindles will start
hitting contention sooner which results into higher latency. If the controller
works well the larger SATA arrays above should be able to maintain their
mediocre latency much better under load than the SCSI array with fewer drives
would maintain its low latency response time despite its drives' lower average
seek time.
I would definitely agree. More factors in favor of more cheap drives:
- cheaper drives (7200 rpm) have larger disks (3.7 diameter against 2.6 or 
3.3). That means the outer tracks hold more data, and the same amount of 
data is held on a smaller area, which means less tracks, which means 
reduced seek times. You can roughly count the real average seek time as 
(average seek time over full disk * size of dataset / capacity of disk). 
And you actually need to physicall seek less often too.

- more disks means less data per disk, which means the data is further 
concentrated on outer tracks, which means even lower seek times

Also, what counts is indeed not so much the time it takes to do one single 
random seek, but the number of random seeks you can do per second. Hence, 
more disks means more seeks per second (if requests are evenly distributed 
among all disks, which a good stripe size should achieve).

Not taking into account TCQ/NCQ or write cache optimizations, the important 
parameter (random seeks per second) can be approximated as:

N * 1000 / (lat + seek * ds / (N * cap))
Where:
N is the number of disks
lat is the average rotational latency in milliseconds (500/(rpm/60))
seek is the average seek over the full disk in milliseconds
ds is the dataset size
cap is the capacity of each disk
Using this formula and a variety of disks, counting only the disks 
themselves (no enclosures, controllers, rack space, power, maintenance...), 
trying to maximize the number of seeks/second for a fixed budget (1000 
euros) with a dataset size of 100 GB makes SATA drives clear winners: you 
can get more than 4000 seeks/second (with 21 x 80GB disks) where SCSI 
cannot even make it to the 1400 seek/second point (with 8 x 36 GB disks). 
Results can vary quite a lot based on the dataset size, which illustrates 
the importance of staying on the edges of the disks. I'll try to make the 
analysis more complete by counting some of the overhead (obviously 21 
drives has a lot of other implications!), but I believe SATA drives still 
win in theory.

It would be interesting to actually compare this to real-world (or 
nearly-real-world) benchmarks to measure the effectiveness of features like 
TCQ/NCQ etc.

Jacques.
 


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] How to improve db performance with $7K?

2005-04-18 Thread Steve Poe

Alex,
In the situation of the animal hospital server I oversee, their 
application is OLTP. Adding hard drives (6-8) does help performance. 
Benchmarks like pgbench and OSDB agree with it, but in reality users 
could not see noticeable change. However, moving the top 5/10 tables and 
indexes to their own space made a greater impact.

Someone who reads PostgreSQL 8.0 Performance Checklist is going to see 
point #1 add more disks is the key. How about adding a subpoint to 
explaining when more disks isn't enough or applicable? I maybe 
generalizing the complexity of tuning an OLTP application, but some 
clarity could help.

Steve Poe


---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [PERFORM] How to improve db performance with $7K?

Not true - the recommended RAID level is RAID 10, not RAID 0+1 (at
least I would never recommend 1+0 for anything).

RAID 10 and RAID 0+1 are _quite_ different.  One gives you very good
redundancy, the other is only slightly better than RAID 5, but
operates faster in degraded mode (single drive).

Alex Turner
netEconomist

On 4/18/05, John A Meinel [EMAIL PROTECTED] wrote:
 Alex Turner wrote:
 
 [snip]
 
 
 Adding drives will not let you get lower response times than the average 
 seek
 time on your drives*. But it will let you reach that response time more 
 often.
 
 
 
 [snip]
 
 I believe your assertion is fundamentaly flawed.  Adding more drives
 will not let you reach that response time more often.  All drives are
 required to fill every request in all RAID levels (except possibly
 0+1, but that isn't used for enterprise applicaitons).
 
 Actually 0+1 is the recommended configuration for postgres databases
 (both for xlog and for the bulk data), because the write speed of RAID5
 is quite poor.
 Hence you base assumption is not correct, and adding drives *does* help.
 
 Most requests
 in OLTP require most of the request time to seek, not to read.  Only
 in single large block data transfers will you get any benefit from
 adding more drives, which is atypical in most database applications.
 For most database applications, the only way to increase
 transactions/sec is to decrease request service time, which is
 generaly achieved with better seek times or a better controller card,
 or possibly spreading your database accross multiple tablespaces on
 seperate paritions.
 
 
 This is probably true. However, if you are doing lots of concurrent
 connections, and things are properly spread across multiple spindles
 (using RAID0+1, or possibly tablespaces across multiple raids).
 Then each seek occurs on a separate drive, which allows them to occur at
 the same time, rather than sequentially. Having 2 processes competing
 for seeking on the same drive is going to be worse than having them on
 separate drives.
 John
 =:-
 
 


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] How to improve db performance with $7K?

I think the add more disks thing is really from the point of view that
one disk isn't enough ever.  You should really have at least four
drives configured into two RAID 1s.  Most DBAs will know this, but
most average Joes won't.

Alex Turner
netEconomist

On 4/18/05, Steve Poe [EMAIL PROTECTED] wrote:
 Alex,
 
 In the situation of the animal hospital server I oversee, their
 application is OLTP. Adding hard drives (6-8) does help performance.
 Benchmarks like pgbench and OSDB agree with it, but in reality users
 could not see noticeable change. However, moving the top 5/10 tables and
 indexes to their own space made a greater impact.
 
 Someone who reads PostgreSQL 8.0 Performance Checklist is going to see
 point #1 add more disks is the key. How about adding a subpoint to
 explaining when more disks isn't enough or applicable? I maybe
 generalizing the complexity of tuning an OLTP application, but some
 clarity could help.
 
 Steve Poe
 


---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] How to improve db performance with $7K?

Ok - well - I am partially wrong...

If you're stripe size is 64Kb, and you are reading 256k worth of data,
it will be spread across four drives, so you will need to read from
four devices to get your 256k of data (RAID 0 or 5 or 10), but if you
are only reading 64kb of data, I guess you would only need to read
from one disk.

So my assertion that adding more drives doesn't help is pretty
wrong... particularly with OLTP because it's always dealing with
blocks that are smaller that the stripe size.

Alex Turner
netEconomist

On 4/18/05, Jacques Caron [EMAIL PROTECTED] wrote:
 Hi,
 
 At 18:56 18/04/2005, Alex Turner wrote:
 All drives are required to fill every request in all RAID levels
 
 No, this is definitely wrong. In many cases, most drives don't actually
 have the data requested, how could they handle the request?
 
 When reading one random sector, only *one* drive out of N is ever used to
 service any given request, be it RAID 0, 1, 0+1, 1+0 or 5.
 
 When writing:
 - in RAID 0, 1 drive
 - in RAID 1, RAID 0+1 or 1+0, 2 drives
 - in RAID 5, you need to read on all drives and write on 2.
 
 Otherwise, what would be the point of RAID 0, 0+1 or 1+0?
 
 Jacques.
 


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] How to improve db performance with $7K?

So I wonder if one could take this stripe size thing further and say
that a larger stripe size is more likely to result in requests getting
served parallized across disks which would lead to increased
performance?

Again, thanks to all people on this list, I know that I have learnt a
_hell_ of alot since subscribing.

Alex Turner
netEconomist

On 4/18/05, Alex Turner [EMAIL PROTECTED] wrote:
 Ok - well - I am partially wrong...
 
 If you're stripe size is 64Kb, and you are reading 256k worth of data,
 it will be spread across four drives, so you will need to read from
 four devices to get your 256k of data (RAID 0 or 5 or 10), but if you
 are only reading 64kb of data, I guess you would only need to read
 from one disk.
 
 So my assertion that adding more drives doesn't help is pretty
 wrong... particularly with OLTP because it's always dealing with
 blocks that are smaller that the stripe size.
 
 Alex Turner
 netEconomist
 
 On 4/18/05, Jacques Caron [EMAIL PROTECTED] wrote:
  Hi,
 
  At 18:56 18/04/2005, Alex Turner wrote:
  All drives are required to fill every request in all RAID levels
 
  No, this is definitely wrong. In many cases, most drives don't actually
  have the data requested, how could they handle the request?
 
  When reading one random sector, only *one* drive out of N is ever used to
  service any given request, be it RAID 0, 1, 0+1, 1+0 or 5.
 
  When writing:
  - in RAID 0, 1 drive
  - in RAID 1, RAID 0+1 or 1+0, 2 drives
  - in RAID 5, you need to read on all drives and write on 2.
 
  Otherwise, what would be the point of RAID 0, 0+1 or 1+0?
 
  Jacques.
 
 


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] How to improve db performance with $7K?

2005-04-18 Thread Greg Stark


Jacques Caron [EMAIL PROTECTED] writes:

 When writing:
 - in RAID 0, 1 drive
 - in RAID 1, RAID 0+1 or 1+0, 2 drives
 - in RAID 5, you need to read on all drives and write on 2.

Actually RAID 5 only really needs to read from two drives. The existing parity
block and the block you're replacing. It just xors the old block, the new
block, and the existing parity block to generate the new parity block.

-- 
greg


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] How to improve db performance with $7K?

2005-04-18 Thread Joshua D. Drake

Alex Turner wrote:
Not true - the recommended RAID level is RAID 10, not RAID 0+1 (at
least I would never recommend 1+0 for anything).
Uhmm I was under the impression that 1+0 was RAID 10 and that 0+1 is NOT
RAID 10.
Ref: http://www.acnc.com/raid.html
Sincerely,
Joshua D. Drake

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] How to improve db performance with $7K?

Hi,
At 20:16 18/04/2005, Alex Turner wrote:
So my assertion that adding more drives doesn't help is pretty
wrong... particularly with OLTP because it's always dealing with
blocks that are smaller that the stripe size.
When doing random seeks (which is what a database needs most of the time), 
the number of disks helps improve the number of seeks per second (which is 
the bottleneck in this case). When doing sequential reads, the number of 
disks helps improve total throughput (which is the bottleneck in that case).

In short: in always helps :-)
Jacques.

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] How to improve db performance with $7K?

2005-04-18 Thread Mohan, Ross

Don't you think optimal stripe width would be
a good question to research the binaries for? I'd
think that drives the answer, largely.  (uh oh, pun alert)

EG, oracle issues IO requests (this may have changed _just_ 
recently) in 64KB chunks, regardless of what you ask for. 
So when I did my striping (many moons ago, when the Earth 
was young...) I did it in 128KB widths, and set the oracle 
multiblock read count according. For oracle, any stripe size
under 64KB=stupid, anything much over 128K/258K=wasteful. 

I am eager to find out how PG handles all this. 


- Ross



p.s. Brooklyn thug accent 'You want a database record? I 
  gotcher record right here' http://en.wikipedia.org/wiki/Akashic_Records



-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Alex Turner
Sent: Monday, April 18, 2005 2:21 PM
To: Jacques Caron
Cc: Greg Stark; William Yu; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] How to improve db performance with $7K?


So I wonder if one could take this stripe size thing further and say that a 
larger stripe size is more likely to result in requests getting served 
parallized across disks which would lead to increased performance?

Again, thanks to all people on this list, I know that I have learnt a _hell_ of 
alot since subscribing.

Alex Turner
netEconomist

On 4/18/05, Alex Turner [EMAIL PROTECTED] wrote:
 Ok - well - I am partially wrong...
 
 If you're stripe size is 64Kb, and you are reading 256k worth of data, 
 it will be spread across four drives, so you will need to read from 
 four devices to get your 256k of data (RAID 0 or 5 or 10), but if you 
 are only reading 64kb of data, I guess you would only need to read 
 from one disk.
 
 So my assertion that adding more drives doesn't help is pretty 
 wrong... particularly with OLTP because it's always dealing with 
 blocks that are smaller that the stripe size.
 
 Alex Turner
 netEconomist
 
 On 4/18/05, Jacques Caron [EMAIL PROTECTED] wrote:
  Hi,
 
  At 18:56 18/04/2005, Alex Turner wrote:
  All drives are required to fill every request in all RAID levels
 
  No, this is definitely wrong. In many cases, most drives don't 
  actually have the data requested, how could they handle the request?
 
  When reading one random sector, only *one* drive out of N is ever 
  used to service any given request, be it RAID 0, 1, 0+1, 1+0 or 5.
 
  When writing:
  - in RAID 0, 1 drive
  - in RAID 1, RAID 0+1 or 1+0, 2 drives
  - in RAID 5, you need to read on all drives and write on 2.
 
  Otherwise, what would be the point of RAID 0, 0+1 or 1+0?
 
  Jacques.
 
 


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] How to improve db performance with $7K?

Hi,
At 20:21 18/04/2005, Alex Turner wrote:
So I wonder if one could take this stripe size thing further and say
that a larger stripe size is more likely to result in requests getting
served parallized across disks which would lead to increased
performance?
Actually, it would be pretty much the opposite. The smaller the stripe 
size, the more evenly distributed data is, and the more disks can be used 
to serve requests. If your stripe size is too large, many random accesses 
within one single file (whose size is smaller than the stripe size/number 
of disks) may all end up on the same disk, rather than being split across 
multiple disks (the extreme case being stripe size = total size of all 
disks, which means concatenation). If all accesses had the same cost (i.e. 
no seek time, only transfer time), the ideal would be to have a stripe size 
equal to the number of disks.

But below a certain size, you're going to use multiple disks to serve one 
single request which would not have taken much more time from a single disk 
(reading even a large number of consecutive blocks within one cylinder does 
not take much more time than reading a single block), so you would add 
unnecessary seeks on a disk that could have served another request in the 
meantime. You should definitely not go below the filesystem block size or 
the database block size.

There is a interesting discussion of the optimal stripe size in the vinum 
manpage on FreeBSD:

http://www.freebsd.org/cgi/man.cgi?query=vinumapropos=0sektion=0manpath=FreeBSD+5.3-RELEASE+and+Portsformat=html
(look for Performance considerations, towards the end -- note however 
that some of the calculations are not entirely correct).

Basically it says the optimal stripe size is somewhere between 256KB and 
4MB, preferably an odd number, and that some hardware RAID controllers 
don't like big stripe sizes. YMMV, as always.

Jacques.

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [PERFORM] How to improve db performance with $7K?

Mistype.. I meant 0+1 in the second instance :(


On 4/18/05, Joshua D. Drake [EMAIL PROTECTED] wrote:
 Alex Turner wrote:
  Not true - the recommended RAID level is RAID 10, not RAID 0+1 (at
  least I would never recommend 1+0 for anything).
 
 Uhmm I was under the impression that 1+0 was RAID 10 and that 0+1 is NOT
 RAID 10.
 
 Ref: http://www.acnc.com/raid.html
 
 Sincerely,
 
 Joshua D. Drake
 
 
  ---(end of broadcast)---
  TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly
 


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] How to improve db performance with $7K?

On 4/18/05, Jacques Caron [EMAIL PROTECTED] wrote:
 Hi,
 
 At 20:21 18/04/2005, Alex Turner wrote:
 So I wonder if one could take this stripe size thing further and say
 that a larger stripe size is more likely to result in requests getting
 served parallized across disks which would lead to increased
 performance?
 
 Actually, it would be pretty much the opposite. The smaller the stripe
 size, the more evenly distributed data is, and the more disks can be used
 to serve requests. If your stripe size is too large, many random accesses
 within one single file (whose size is smaller than the stripe size/number
 of disks) may all end up on the same disk, rather than being split across
 multiple disks (the extreme case being stripe size = total size of all
 disks, which means concatenation). If all accesses had the same cost (i.e.
 no seek time, only transfer time), the ideal would be to have a stripe size
 equal to the number of disks.
 
[snip]

Ahh yes - but the critical distinction is this:
The smaller the stripe size, the more disks will be used to serve _a_
request - which is bad for OLTP because you want fewer disks per
request so that you can have more requests per second because the cost
is mostly seek.  If more than one disk has to seek to serve a single
request, you are preventing that disk from serving a second request at
the same time.

To have more throughput in MB/sec, you want a smaller stripe size so
that you have more disks serving a single request allowing you to
multiple by effective drives to get total bandwidth.

Because OLTP is made up of small reads and writes to a small number of
different files, I would guess that you want those files split up
across your RAID, but not so much that a single small read or write
operation would traverse more than one disk.   That would infer that
your optimal stripe size is somewhere on the right side of the bell
curve that represents your database read and write block count
distribution.  If on average the dbwritter never flushes less than 1MB
to disk at a time, then I guess your best stripe size would be 1MB,
but that seems very large to me.

So I think therefore that I may be contending the exact opposite of
what you are postulating!

Alex Turner
netEconomist

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] How to improve db performance with $7K?

2005-04-18 Thread Bruce Momjian

Kevin Brown wrote:
 Greg Stark wrote:
 
 
  I think you're being misled by analyzing the write case.
  
  Consider the read case. When a user process requests a block and
  that read makes its way down to the driver level, the driver can't
  just put it aside and wait until it's convenient. It has to go ahead
  and issue the read right away.
 
 Well, strictly speaking it doesn't *have* to.  It could delay for a
 couple of milliseconds to see if other requests come in, and then
 issue the read if none do.  If there are already other requests being
 fulfilled, then it'll schedule the request in question just like the
 rest.

The idea with SCSI or any command queuing is that you don't have to wait
for another request to come in --- you can send the request as it
arrives, then if another shows up, you send that too, and the drive
optimizes the grouping at a later time, knowing what the drive is doing,
rather queueing in the kernel.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] How to improve db performance with $7K?

Does it really matter at which end of the cable the queueing is done
(Assuming both ends know as much about drive geometry etc..)?

Alex Turner
netEconomist

On 4/18/05, Bruce Momjian pgman@candle.pha.pa.us wrote:
 Kevin Brown wrote:
  Greg Stark wrote:
 
 
   I think you're being misled by analyzing the write case.
  
   Consider the read case. When a user process requests a block and
   that read makes its way down to the driver level, the driver can't
   just put it aside and wait until it's convenient. It has to go ahead
   and issue the read right away.
 
  Well, strictly speaking it doesn't *have* to.  It could delay for a
  couple of milliseconds to see if other requests come in, and then
  issue the read if none do.  If there are already other requests being
  fulfilled, then it'll schedule the request in question just like the
  rest.
 
 The idea with SCSI or any command queuing is that you don't have to wait
 for another request to come in --- you can send the request as it
 arrives, then if another shows up, you send that too, and the drive
 optimizes the grouping at a later time, knowing what the drive is doing,
 rather queueing in the kernel.
 
 --
   Bruce Momjian|  http://candle.pha.pa.us
   pgman@candle.pha.pa.us   |  (610) 359-1001
   +  If your life is a hard drive, |  13 Roberts Road
   +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073
 
 ---(end of broadcast)---
 TIP 9: the planner will ignore your desire to choose an index scan if your
   joining column's datatypes do not match


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] How to improve db performance with $7K?

2005-04-18 Thread Alvaro Herrera

On Mon, Apr 18, 2005 at 06:49:44PM -0400, Alex Turner wrote:
 Does it really matter at which end of the cable the queueing is done
 (Assuming both ends know as much about drive geometry etc..)?

That is a pretty strong assumption, isn't it?  Also you seem to be
assuming that the controller-disk protocol (some internal, unknown to
mere mortals, mechanism) is equally powerful than the host-controller
(SATA, SCSI, etc).

I'm lost whether this thread is about what is possible with current,
in-market technology, or about what could in theory be possible [if you
were to design open source disk controllers and disks.]

-- 
Alvaro Herrera ([EMAIL PROTECTED])
La fuerza no está en los medios físicos
sino que reside en una voluntad indomable (Gandhi)

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] How to improve db performance with $7K?

2005-04-18 Thread Matthew Nuzum

On 4/14/05, Tom Lane [EMAIL PROTECTED] wrote:
 
 That's basically what it comes down to: SCSI lets the disk drive itself
 do the low-level I/O scheduling whereas the ATA spec prevents the drive
 from doing so (unless it cheats, ie, caches writes).  Also, in SCSI it's
 possible for the drive to rearrange reads as well as writes --- which
 AFAICS is just not possible in ATA.  (Maybe in the newest spec...)
 
 The reason this is so much more of a win than it was when ATA was
 designed is that in modern drives the kernel has very little clue about
 the physical geometry of the disk.  Variable-size tracks, bad-block
 sparing, and stuff like that make for a very hard-to-predict mapping
 from linear sector addresses to actual disk locations.  Combine that
 with the fact that the drive controller can be much smarter than it was
 twenty years ago, and you can see that the case for doing I/O scheduling
 in the kernel and not in the drive is pretty weak.
   
 

So if you all were going to choose between two hard drives where:
drive A has capacity C and spins at 15K rpms, and
drive B has capacity 2 x C and spins at 10K rpms and
all other features are the same, the price is the same and C is enough
disk space which would you choose?

I've noticed that on IDE drives, as the capacity increases the data
density increases and there is a pereceived (I've not measured it)
performance increase.

Would the increased data density of the higher capacity drive be of
greater benefit than the faster spindle speed of drive A?

-- 
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] How to improve db performance with $7K?

2005-04-15 Thread PFC


My argument is that a sufficiently smart kernel scheduler *should*
yield performance results that are reasonably close to what you can
get with that feature.  Perhaps not quite as good, but reasonably
close.  It shouldn't be an orders-of-magnitude type difference.
	And a controller card (or drive) has a lot less RAM to use as a cache /  
queue for reordering stuff than the OS has, potentially the OS can us most  
of the available RAM, which can be gigabytes on a big server, whereas in  
the drive there are at most a few tens of megabytes...

	However all this is a bit looking at the problem through the wrong end.  
The OS should provide a multi-read call for the applications to pass a  
list of blocks they'll need, then reorder them and read them the fastest  
possible way, clustering them with similar requests from other threads.

	Right now when a thread/process issues a read() it will block until the  
block is delivered to this thread. The OS does not know if this thread  
will then need the next block (which can be had very cheaply if you know  
ahead of time you'll need it) or not. Thus it must make guesses, read  
ahead (sometimes), etc...

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [PERFORM] How to improve db performance with $7K?

2005-04-15 Thread PFC


platter compared to the rotational speed, which would agree with the
fact that you can read 70MB/sec, but it takes up to 13ms to seek.
	Actually :
	- the head has to be moved
	this time depends on the distance, for instance moving from a cylinder to  
the next is very fast (it needs to, to get good throughput)
	- then you have to wait for the disk to spin until the information you  
want comes in front of the head... statistically you have to wait a half  
rotation. And this does not depend on the distance between the cylinders,  
it depends on the position of the data in the cylinder.
	The more RPMs you have, the less you wait, which is why faster RPMs  
drives have faster seek (they must also have faster actuators to move the  
head)...

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] How to improve db performance with $7K?

2005-04-15 Thread Alan Stange

PFC wrote:

My argument is that a sufficiently smart kernel scheduler *should*
yield performance results that are reasonably close to what you can
get with that feature.  Perhaps not quite as good, but reasonably
close.  It shouldn't be an orders-of-magnitude type difference.

And a controller card (or drive) has a lot less RAM to use as a 
cache /  queue for reordering stuff than the OS has, potentially the 
OS can us most  of the available RAM, which can be gigabytes on a big 
server, whereas in  the drive there are at most a few tens of 
megabytes...

However all this is a bit looking at the problem through the wrong 
end.  The OS should provide a multi-read call for the applications to 
pass a  list of blocks they'll need, then reorder them and read them 
the fastest  possible way, clustering them with similar requests from 
other threads.

Right now when a thread/process issues a read() it will block 
until the  block is delivered to this thread. The OS does not know if 
this thread  will then need the next block (which can be had very 
cheaply if you know  ahead of time you'll need it) or not. Thus it 
must make guesses, read  ahead (sometimes), etc...
All true.  Which is why high performance computing folks use 
aio_read()/aio_write() and load up the kernel with all the requests they 
expect to make. 

The kernels that I'm familiar with will do read ahead on files based on 
some heuristics:  when you read the first byte of a file the OS will 
typically load up several pages of the file (depending on file size, 
etc).  If you continue doing read() calls without a seek() on the file 
descriptor the kernel will get the hint that you're doing a sequential 
read and continue caching up the pages ahead of time, usually using the 
pages you just read to hold the new data so that one isn't bloating out 
memory with data that won't be needed again.  Throw in a seek() and the 
amount of read ahead caching may be reduced.

One point that is being missed in all this discussion is that the file 
system also imposes some constraints on how IO's can be done.  For 
example, simply doing a write(fd, buf, 1) doesn't emit a stream 
of sequential blocks to the drives.  Some file systems (UFS was one) 
would force portions of large files into other cylinder groups so that 
small files could be located near the inode data, thus avoiding/reducing 
the size of seeks.  Similarly, extents need to be allocated and the 
bitmaps recording this data usually need synchronous updates, which will 
require some seeks, etc.  Not to mention the need to update inode data, 
etc.  Anyway, my point is that the allocation policies of the file 
system can confuse the situation.

Also, the seek times one sees reported are an average.  One really needs 
to look at the track-to-track seek time and also the full stoke seek 
times.   It takes a *long* time to move the heads across the whole 
platter.  I've seen people partition drives to only use small regions of 
the drives to avoid long seeks and to better use the increased number of 
bits going under the head in one rotation.   A 15K drive doesn't need to 
have a faster seek time than a 10K drive because the rotational speed is 
higher.  The average seek time might be faster just because the 15K 
drives are smaller with fewer number of cylinders. 

-- Alan
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [PERFORM] How to improve db performance with $7K?

2005-04-15 Thread Vivek Khera

On Apr 14, 2005, at 10:03 PM, Kevin Brown wrote:
Now, bad block remapping destroys that guarantee, but unless you've
got a LOT of bad blocks, it shouldn't destroy your performance, right?
ALL disks have bad blocks, even when you receive them.  you honestly 
think that these large disks made today (18+ GB is the smallest now) 
that there are no defects on the surfaces?

/me remembers trying to cram an old donated 5MB (yes M) disk into an 
old 8088 Zenith PC in college...

Vivek Khera, Ph.D.
+1-301-869-4449 x806


smime.p7s
Description: S/MIME cryptographic signature

Re: [PERFORM] How to improve db performance with $7K?

2005-04-15 Thread Joshua D. Drake

Vivek Khera wrote:
On Apr 14, 2005, at 10:03 PM, Kevin Brown wrote:
Now, bad block remapping destroys that guarantee, but unless you've
got a LOT of bad blocks, it shouldn't destroy your performance, right?
ALL disks have bad blocks, even when you receive them.  you honestly 
think that these large disks made today (18+ GB is the smallest now) 
that there are no defects on the surfaces?
That is correct. It is just that the HD makers will mark the bad blocks
so that the OS knows not to use them. You can also run the bad blocks
command to try and find new bad blocks.
Over time hard drives get bad blocks. It doesn't always mean you have to 
replace the drive but it does mean you need to maintain it and usually
at least backup, low level (if scsi) and mark bad blocks. Then restore.

Sincerely,
Joshua D. Drake

/me remembers trying to cram an old donated 5MB (yes M) disk into an old 
8088 Zenith PC in college...

Vivek Khera, Ph.D.
+1-301-869-4449 x806

--
Your PostgreSQL solutions provider, Command Prompt, Inc.
24x7 support - 1.800.492.2240, programming, and consulting
Home of PostgreSQL Replicator, plPHP, plPerlNG and pgPHPToolkit
http://www.commandprompt.com / http://www.postgresql.org
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [PERFORM] How to improve db performance with $7K?

2005-04-15 Thread Vivek Khera

On Apr 15, 2005, at 11:58 AM, Joshua D. Drake wrote:
ALL disks have bad blocks, even when you receive them.  you honestly 
think that these large disks made today (18+ GB is the smallest now) 
that there are no defects on the surfaces?
That is correct. It is just that the HD makers will mark the bad blocks
so that the OS knows not to use them. You can also run the bad blocks
command to try and find new bad blocks.
my point was that you cannot assume an linear correlation between block 
number and physical location, since the bad blocks will be mapped all 
over the place.

Vivek Khera, Ph.D.
+1-301-869-4449 x806


smime.p7s
Description: S/MIME cryptographic signature

Re: [PERFORM] How to improve db performance with $7K?

2005-04-15 Thread Mohan, Ross

Greg, et al. 

I never found any evidence of a stop and get an intermediate request
functionality in the TCQ protocol. 

IIRC, what is there is

1) Ordered
2) Head First
3) Simple

implemented as choices. *VERY* roughly, that'd be like
(1) disk subsystem satisfies requests as submitted, (2) let's
the this request be put at the very head of the per se disk
queue after the currently-running disk request is complete, and
(3) is let the per se disk and it's software reorder the requests
on-hand as per it's onboard software.  (N.B. in the last, it's
the DISK not the controller making those decisions). (N.B. too, that
this last is essentially what NCQ (cf. TCQ) is doing )

I know we've been batting around a hypothetical case of SCSI
where it stops and gets smth. on the way, but I can find
no proof (yet) that this is done, pro forma, by SCSI drives. 

In other words, SCSI is a necessary, but not sufficient cause
for intermediate reading. 

FWIW

- Ross

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Greg Stark
Sent: Friday, April 15, 2005 2:02 PM
To: Tom Lane
Cc: Kevin Brown; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] How to improve db performance with $7K?


Tom Lane [EMAIL PROTECTED] writes:

 Yes, you can probably assume that blocks with far-apart numbers are 
 going to require a big seek, and you might even be right in supposing 
 that a block with an intermediate number should be read on the way. 
 But you have no hope at all of making the right decisions at a more 
 local level --- say, reading various sectors within the same cylinder 
 in an optimal fashion.  You don't know where the track boundaries are, 
 so you can't schedule in a way that minimizes rotational latency. 
 You're best off to throw all the requests at the drive together and 
 let the drive sort it out.

Consider for example three reads, one at the beginning of the disk, one at the 
very end, and one in the middle. If the three are performed in the logical 
order (assuming the head starts at the beginning), then the drive has to seek, 
say, 4ms to get to the middle and 4ms to get to the end.

But if the middle block requires a full rotation to reach it from when the head 
arrives that adds another 8ms of rotational delay (assuming a 7200RPM drive).

Whereas the drive could have seeked over to the last block, then seeked back in 
8ms and gotten there just in time to perform the read for free.


I'm not entirely convinced this explains all of the SCSI drives' superior 
performance though. The above is about a worst-case scenario. should really 
only have a small effect, and it's not like the drive firmware can really 
schedule things perfectly either.


I think most of the difference is that the drive manufacturers just don't 
package their high end drives with ATA interfaces. So there are no 10k RPM ATA 
drives and no 15k RPM ATA drives. I think WD is making fast SATA drives but 
most of the manufacturers aren't even doing that.

-- 
greg


---(end of broadcast)---
TIP 8: explain analyze is your friend

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] How to improve db performance with $7K?

2005-04-15 Thread Kevin Brown

Tom Lane wrote:
 Kevin Brown [EMAIL PROTECTED] writes:
  In the case of pure random reads, you'll end up having to wait an
  average of half of a rotation before beginning the read.
 
 You're assuming the conclusion.  The above is true if the disk is handed
 one request at a time by a kernel that doesn't have any low-level timing
 information.  If there are multiple random requests on the same track,
 the drive has an opportunity to do better than that --- if it's got all
 the requests in hand.

True, but see below.  Actually, I suspect what matters is if they're
on the same cylinder (which may be what you're talking about here).
And in the above, I was assuming randomly distributed single-sector
reads.  In that situation, we can't generically know what the
probability that more than one will appear on the same cylinder
without knowing something about the drive geometry.


That said, most modern drives have tens of thousands of cylinders (the
Seagate ST380011a, an 80 gigabyte drive, has 94,600 tracks per inch
according to its datasheet), but much, much smaller queue lengths
(tens of entries, hundreds at most, I'd expect.  Hard data on this
would be appreciated).  For purely random reads, the probability that
two or more requests in the queue happen to be in the same cylinder is
going to be quite small.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] How to improve db performance with $7K?

2005-04-15 Thread Kevin Brown

Vivek Khera wrote:
 
 On Apr 14, 2005, at 10:03 PM, Kevin Brown wrote:
 
 Now, bad block remapping destroys that guarantee, but unless you've
 got a LOT of bad blocks, it shouldn't destroy your performance, right?
 
 
 ALL disks have bad blocks, even when you receive them.  you honestly 
 think that these large disks made today (18+ GB is the smallest now) 
 that there are no defects on the surfaces?

Oh, I'm not at all arguing that you won't have bad blocks.  My
argument is that the probability of any given block read or write
operation actually dealing with a remapped block is going to be
relatively small, unless the fraction of bad blocks to total blocks is
large (in which case you basically have a bad disk).  And so the
ability to account for remapped blocks shouldn't itself represent a
huge improvement in overall throughput.




-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] How to improve db performance with $7K?

Tom Lane wrote:
 Greg Stark [EMAIL PROTECTED] writes:
  In any case the issue with the IDE protocol is that fundamentally you
  can only have a single command pending. SCSI can have many commands
  pending.
 
 That's the bottom line: the SCSI protocol was designed (twenty years ago!)
 to allow the drive to do physical I/O scheduling, because the CPU can
 issue multiple commands before the drive has to report completion of the
 first one.  IDE isn't designed to do that.  I understand that the latest
 revisions to the IDE/ATA specs allow the drive to do this sort of thing,
 but support for it is far from widespread.

My question is: why does this (physical I/O scheduling) seem to matter
so much?

Before you flame me for asking a terribly idiotic question, let me
provide some context.

The operating system maintains a (sometimes large) buffer cache, with
each buffer being mapped to a physical (which in the case of RAID is
really a virtual) location on the disk.  When the kernel needs to
flush the cache (e.g., during a sync(), or when it needs to free up
some pages), it doesn't write the pages in memory address order, it
writes them in *device* address order.  And it, too, maintains a queue
of disk write requests.

Now, unless some of the blocks on the disk are remapped behind the
scenes such that an ordered list of blocks in the kernel translates to
an out of order list on the target disk (which should be rare, since
such remapping usually happens only when the target block is bad), how
can the fact that the disk controller doesn't do tagged queuing
*possibly* make any real difference unless the kernel's disk
scheduling algorithm is suboptimal?  In fact, if the kernel's
scheduling algorithm is close to optimal, wouldn't the disk queuing
mechanism *reduce* the overall efficiency of disk writes?  After all,
the kernel's queue is likely to be much larger than the disk
controller's, and the kernel has knowledge of things like the
filesystem layout that the disk controller and disks do not have.  If
the controller is only able to execute a subset of the write commands
that the kernel has in its queue, at the very least the controller may
end up leaving the head(s) in a suboptimal position relative to the
next set of commands that it hasn't received yet, unless it simply
writes the blocks in the order it receives it, right (admittedly, this
is somewhat trivially dealt with by having the controller exclude the
first and last blocks in the request from its internal sort).


I can see how you might configure the RAID controller so that the
kernel's scheduling algorithm will screw things up horribly.  For
instance, if the controller has several RAID volumes configured in
such a way that the volumes share spindles, the kernel isn't likely to
know about that (since each volume appears as its own device), so
writes to multiple volumes can cause head movement where the kernel
might be treating the volumes as completely independent.  But that
just means that you can't be dumb about how you configure your RAID
setup.


So what gives?  Given the above, why is SCSI so much more efficient
than plain, dumb SATA?  And why wouldn't you be much better off with a
set of dumb controllers in conjunction with (kernel-level) software
RAID?


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] How to improve db performance with $7K?

2005-04-14 Thread Greg Stark


Kevin Brown [EMAIL PROTECTED] writes:

 My question is: why does this (physical I/O scheduling) seem to matter
 so much?
 
 Before you flame me for asking a terribly idiotic question, let me
 provide some context.
 
 The operating system maintains a (sometimes large) buffer cache, with
 each buffer being mapped to a physical (which in the case of RAID is
 really a virtual) location on the disk.  When the kernel needs to
 flush the cache (e.g., during a sync(), or when it needs to free up
 some pages), it doesn't write the pages in memory address order, it
 writes them in *device* address order.  And it, too, maintains a queue
 of disk write requests.

I think you're being misled by analyzing the write case.

Consider the read case. When a user process requests a block and that read
makes its way down to the driver level, the driver can't just put it aside and
wait until it's convenient. It has to go ahead and issue the read right away.

In the 10ms or so that it takes to seek to perform that read *nothing* gets
done. If the driver receives more read or write requests it just has to sit on
them and wait. 10ms is a lifetime for a computer. In that time dozens of other
processes could have been scheduled and issued reads of their own.

If any of those requests would have lied on the intervening tracks the drive
missed a chance to execute them. Worse, it actually has to backtrack to get to
them meaning another long seek.

The same thing would happen if you had lots of processes issuing lots of small
fsynced writes all over the place. Postgres doesn't really do that though. It
sort of does with the WAL logs, but that shouldn't cause a lot of seeking.
Perhaps it would mean that having your WAL share a spindle with other parts of
the OS would have a bigger penalty on IDE drives than on SCSI drives though?

-- 
greg


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] How to improve db performance with $7K?

Greg Stark wrote:


 I think you're being misled by analyzing the write case.
 
 Consider the read case. When a user process requests a block and
 that read makes its way down to the driver level, the driver can't
 just put it aside and wait until it's convenient. It has to go ahead
 and issue the read right away.

Well, strictly speaking it doesn't *have* to.  It could delay for a
couple of milliseconds to see if other requests come in, and then
issue the read if none do.  If there are already other requests being
fulfilled, then it'll schedule the request in question just like the
rest.

 In the 10ms or so that it takes to seek to perform that read
 *nothing* gets done. If the driver receives more read or write
 requests it just has to sit on them and wait. 10ms is a lifetime for
 a computer. In that time dozens of other processes could have been
 scheduled and issued reads of their own.

This is true, but now you're talking about a situation where the
system goes from an essentially idle state to one of furious
activity.  In other words, it's a corner case that I strongly suspect
isn't typical in situations where SCSI has historically made a big
difference.

Once the first request has been fulfilled, the driver can now schedule
the rest of the queued-up requests in disk-layout order.


I really don't see how this is any different between a system that has
tagged queueing to the disks and one that doesn't.  The only
difference is where the queueing happens.  In the case of SCSI, the
queueing happens on the disks (or at least on the controller).  In the
case of SATA, the queueing happens in the kernel.

I suppose the tagged queueing setup could begin the head movement and,
if another request comes in that requests a block on a cylinder
between where the head currently is and where it's going, go ahead and
read the block in question.  But is that *really* what happens in a
tagged queueing system?  It's the only major advantage I can see it
having.


 The same thing would happen if you had lots of processes issuing
 lots of small fsynced writes all over the place. Postgres doesn't
 really do that though. It sort of does with the WAL logs, but that
 shouldn't cause a lot of seeking.  Perhaps it would mean that having
 your WAL share a spindle with other parts of the OS would have a
 bigger penalty on IDE drives than on SCSI drives though?

Perhaps.

But I rather doubt that has to be a huge penalty, if any.  When a
process issues an fsync (or even a sync), the kernel doesn't *have* to
drop everything it's doing and get to work on it immediately.  It
could easily gather a few more requests, bundle them up, and then
issue them.  If there's a lot of disk activity, it's probably smart to
do just that.  All fsync and sync require is that the caller block
until the data hits the disk (from the point of view of the kernel).
The specification doesn't require that the kernel act on the calls
immediately or write only the blocks referred to by the call in
question.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] How to improve db performance with $7K?

Kevin Brown [EMAIL PROTECTED] writes:
 I really don't see how this is any different between a system that has
 tagged queueing to the disks and one that doesn't.  The only
 difference is where the queueing happens.  In the case of SCSI, the
 queueing happens on the disks (or at least on the controller).  In the
 case of SATA, the queueing happens in the kernel.

That's basically what it comes down to: SCSI lets the disk drive itself
do the low-level I/O scheduling whereas the ATA spec prevents the drive
from doing so (unless it cheats, ie, caches writes).  Also, in SCSI it's
possible for the drive to rearrange reads as well as writes --- which
AFAICS is just not possible in ATA.  (Maybe in the newest spec...)

The reason this is so much more of a win than it was when ATA was
designed is that in modern drives the kernel has very little clue about
the physical geometry of the disk.  Variable-size tracks, bad-block
sparing, and stuff like that make for a very hard-to-predict mapping
from linear sector addresses to actual disk locations.  Combine that
with the fact that the drive controller can be much smarter than it was
twenty years ago, and you can see that the case for doing I/O scheduling
in the kernel and not in the drive is pretty weak.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] How to improve db performance with $7K?

2005-04-14 Thread Rosser Schwarz

while you weren't looking, Kevin Brown wrote:

[reordering bursty reads]

 In other words, it's a corner case that I strongly suspect
 isn't typical in situations where SCSI has historically made a big
 difference.

[...]

 But I rather doubt that has to be a huge penalty, if any.  When a
 process issues an fsync (or even a sync), the kernel doesn't *have* to
 drop everything it's doing and get to work on it immediately.  It
 could easily gather a few more requests, bundle them up, and then
 issue them.

To make sure I'm following you here, are you or are you not suggesting
that the kernel could sit on -all- IO requests for some small handful
of ms before actually performing any IO to address what you strongly
suspect is a corner case?

/rls

-- 
:wq

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] How to improve db performance with $7K?

2005-04-14 Thread Mohan, Ross

Imagine a system in furious activity with two (2) process regularly occuring

Process One:  Long read (or write). Takes 20ms to do seek, latency, and 
stream off. Runs over and over. 
Process Two:  Single block read ( or write ). Typical database row access. 
Optimally, could be submillisecond. happens more or less 
randomly. 


Let's say process one starts, and then process two. Assume, for sake of this 
discussion, 
that P2's block lies w/in P1's swath. (But doesn't have to...)

Now, everytime process two has to wait at LEAST 20ms to complete. In a 
queue-reordering
system, it could be a lot faster. And me, looking for disk service times on P2, 
keep
wondering why does a single diskblock read keep taking 20ms?


Sit doesn't need to be a read or a write. It doesn't need to be 
furious activity
(two processes is not furious, even for a single user desktop.)  This is not a 
corner case, 
and while it doesn't take into account kernel/drivecache/UBC buffering issues, 
I think it
shines a light on why command re-ordering might be useful. shrug 

YMMV. 



-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Kevin Brown
Sent: Thursday, April 14, 2005 4:36 AM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] How to improve db performance with $7K?


Greg Stark wrote:


 I think you're being misled by analyzing the write case.
 
 Consider the read case. When a user process requests a block and that 
 read makes its way down to the driver level, the driver can't just put 
 it aside and wait until it's convenient. It has to go ahead and issue 
 the read right away.

Well, strictly speaking it doesn't *have* to.  It could delay for a couple of 
milliseconds to see if other requests come in, and then issue the read if none 
do.  If there are already other requests being fulfilled, then it'll schedule 
the request in question just like the rest.

 In the 10ms or so that it takes to seek to perform that read
 *nothing* gets done. If the driver receives more read or write 
 requests it just has to sit on them and wait. 10ms is a lifetime for a 
 computer. In that time dozens of other processes could have been 
 scheduled and issued reads of their own.

This is true, but now you're talking about a situation where the system goes 
from an essentially idle state to one of furious activity.  In other words, 
it's a corner case that I strongly suspect isn't typical in situations where 
SCSI has historically made a big difference.

Once the first request has been fulfilled, the driver can now schedule the rest 
of the queued-up requests in disk-layout order.


I really don't see how this is any different between a system that has tagged 
queueing to the disks and one that doesn't.  The only difference is where the 
queueing happens.  In the case of SCSI, the queueing happens on the disks (or 
at least on the controller).  In the case of SATA, the queueing happens in the 
kernel.

I suppose the tagged queueing setup could begin the head movement and, if 
another request comes in that requests a block on a cylinder between where the 
head currently is and where it's going, go ahead and read the block in 
question.  But is that *really* what happens in a tagged queueing system?  It's 
the only major advantage I can see it having.


 The same thing would happen if you had lots of processes issuing lots 
 of small fsynced writes all over the place. Postgres doesn't really do 
 that though. It sort of does with the WAL logs, but that shouldn't 
 cause a lot of seeking.  Perhaps it would mean that having your WAL 
 share a spindle with other parts of the OS would have a bigger penalty 
 on IDE drives than on SCSI drives though?

Perhaps.

But I rather doubt that has to be a huge penalty, if any.  When a process 
issues an fsync (or even a sync), the kernel doesn't *have* to drop everything 
it's doing and get to work on it immediately.  It could easily gather a few 
more requests, bundle them up, and then issue them.  If there's a lot of disk 
activity, it's probably smart to do just that.  All fsync and sync require is 
that the caller block until the data hits the disk (from the point of view of 
the kernel). The specification doesn't require that the kernel act on the calls 
immediately or write only the blocks referred to by the call in question.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] How to improve db performance with $7K?

2005-04-14 Thread Matthew Nuzum

On 4/14/05, Tom Lane [EMAIL PROTECTED] wrote:

 That's basically what it comes down to: SCSI lets the disk drive itself
 do the low-level I/O scheduling whereas the ATA spec prevents the drive
 from doing so (unless it cheats, ie, caches writes).  Also, in SCSI it's
 possible for the drive to rearrange reads as well as writes --- which
 AFAICS is just not possible in ATA.  (Maybe in the newest spec...)

 The reason this is so much more of a win than it was when ATA was
 designed is that in modern drives the kernel has very little clue about
 the physical geometry of the disk.  Variable-size tracks, bad-block
 sparing, and stuff like that make for a very hard-to-predict mapping
 from linear sector addresses to actual disk locations.  Combine that
 with the fact that the drive controller can be much smarter than it was
 twenty years ago, and you can see that the case for doing I/O scheduling
 in the kernel and not in the drive is pretty weak.



So if you all were going to choose between two hard drives where:
drive A has capacity C and spins at 15K rpms, and
drive B has capacity 2 x C and spins at 10K rpms and
all other features are the same, the price is the same and C is enough
disk space which would you choose?

I've noticed that on IDE drives, as the capacity increases the data
density increases and there is a pereceived (I've not measured it)
performance increase.

Would the increased data density of the higher capacity drive be of
greater benefit than the faster spindle speed of drive A?

-- 
Matthew Nuzum [EMAIL PROTECTED]
www.followers.net - Makers of Elite Content Management System
View samples of Elite CMS in action by visiting
http://www.followers.net/portfolio/




---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] How to improve db performance with $7K?

2005-04-14 Thread Greg Stark

Matthew Nuzum [EMAIL PROTECTED] writes:

 drive A has capacity C and spins at 15K rpms, and
 drive B has capacity 2 x C and spins at 10K rpms and
 all other features are the same, the price is the same and C is enough
 disk space which would you choose?

In this case you always choose the 15k RPM drive, at least for Postgres.
The 15kRPM reduces the latency which improves performance when fsyncing
transaction commits.

The real question is whether you choose the single 15kRPM drive or additional
drives at 10kRPM... Additional spindles would give a much bigger bandwidth
improvement but questionable latency improvement.

 Would the increased data density of the higher capacity drive be of
 greater benefit than the faster spindle speed of drive A?

actually a 2xC capacity drive probably just has twice as many platters which
means it would perform identically to the C capacity drive. If it has denser
platters that might improve performance slightly.


-- 
greg


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] How to improve db performance with $7K?

2005-04-14 Thread Greg Stark

Kevin Brown [EMAIL PROTECTED] writes:

 Greg Stark wrote:
 
 
  I think you're being misled by analyzing the write case.
  
  Consider the read case. When a user process requests a block and
  that read makes its way down to the driver level, the driver can't
  just put it aside and wait until it's convenient. It has to go ahead
  and issue the read right away.
 
 Well, strictly speaking it doesn't *have* to.  It could delay for a
 couple of milliseconds to see if other requests come in, and then
 issue the read if none do.  If there are already other requests being
 fulfilled, then it'll schedule the request in question just like the
 rest.

But then the cure is worse than the disease. You're basically describing
exactly what does happen anyways, only you're delaying more requests than
necessary. That intervening time isn't really idle, it's filled with all the
requests that were delayed during the previous large seek...

 Once the first request has been fulfilled, the driver can now schedule
 the rest of the queued-up requests in disk-layout order.
 
 I really don't see how this is any different between a system that has
 tagged queueing to the disks and one that doesn't.  The only
 difference is where the queueing happens.  

And *when* it happens. Instead of being able to issue requests while a large
seek is happening and having some of them satisfied they have to wait until
that seek is finished and get acted on during the next large seek.

If my theory is correct then I would expect bandwidth to be essentially
equivalent but the latency on SATA drives to be increased by about 50% of the
average seek time. Ie, while a busy SCSI drive can satisfy most requests in
about 10ms a busy SATA drive would satisfy most requests in 15ms. (add to that
that 10k RPM and 15kRPM SCSI drives have even lower seek times and no such
IDE/SATA drives exist...)

In reality higher latency feeds into a system feedback loop causing your
application to run slower causing bandwidth demands to be lower as well. It's
often hard to distinguish root causes from symptoms when optimizing complex
systems.

-- 
greg


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] How to improve db performance with $7K?

Matthew Nuzum [EMAIL PROTECTED] writes:
 So if you all were going to choose between two hard drives where:
 drive A has capacity C and spins at 15K rpms, and
 drive B has capacity 2 x C and spins at 10K rpms and
 all other features are the same, the price is the same and C is enough
 disk space which would you choose?

 I've noticed that on IDE drives, as the capacity increases the data
 density increases and there is a pereceived (I've not measured it)
 performance increase.

 Would the increased data density of the higher capacity drive be of
 greater benefit than the faster spindle speed of drive A?

Depends how they got the 2x capacity increase.  If they got it by
increased bit density --- same number of tracks, but more sectors
per track --- then drive B actually has a higher transfer rate,
because in one rotation it can transfer twice as much data as drive A.
More tracks per cylinder (ie, more platters) can also be a speed win
since you can touch more data before you have to seek to another
cylinder.  Drive B will lose if the 2x capacity was all from adding
cylinders (unless its seek-time spec is way better than A's ... which
is unlikely but not impossible, considering the cylinders are probably
closer together).

Usually there's some-of-each involved, so it's hard to make any
definite statement without more facts.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] How to improve db performance with $7K?

2005-04-14 Thread Joshua D. Drake

Steve Poe wrote:
If SATA drives don't have the ability to replace SCSI for a multi-user
I don't think it is a matter of not having the ability. SATA all in all 
is fine as long as
it is battery backed. It isn't as high performing as SCSI but who says 
it has to be?

There are plenty of companies running databases on SATA without issue. Would
I put it on a database that is expecting to have 500 connections at all 
times? No.
Then again, if you have an application with that requirement, you have 
the money
to buy a big fat SCSI array.

Sincerely,
Joshua D. Drake

Postgres apps, but you needed to save on cost (ALWAYS an issue), 
could/would you implement SATA for your logs (pg_xlog) and keep the 
rest on SCSI?

Steve Poe
Mohan, Ross wrote:
I've been doing some reading up on this, trying to keep up here, and 
have found out that (experts, just yawn and cover your ears)

1) some SATA drives (just type II, I think?) have a Phase Zero
   implementation of Tagged Command Queueing (the special sauce
   for SCSI).
2) This SATA TCQ is called NCQ and I believe it basically
   allows the disk software itself to do the reordering
   (this is called simple in TCQ terminology) It does not
   yet allow the TCQ head of queue command, allowing the
   current tagged request to go to head of queue, which is
   a simple way of manifesting a high priority request.
3) SATA drives are not yet multi-initiator?
Largely b/c of 2 and 3, multi-initiator SCSI RAID'ed drives
are likely to whomp SATA II drives for a while yet (read: a
year or two) in multiuser PostGres applications.
-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Greg Stark
Sent: Thursday, April 14, 2005 2:04 PM
To: Kevin Brown
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] How to improve db performance with $7K?

Kevin Brown [EMAIL PROTECTED] writes:
 

Greg Stark wrote:
  

I think you're being misled by analyzing the write case.
Consider the read case. When a user process requests a block and 
that read makes its way down to the driver level, the driver can't 
just put it aside and wait until it's convenient. It has to go 
ahead and issue the read right away.

Well, strictly speaking it doesn't *have* to.  It could delay for a 
couple of milliseconds to see if other requests come in, and then 
issue the read if none do.  If there are already other requests 
being fulfilled, then it'll schedule the request in question just 
like the rest.
  

But then the cure is worse than the disease. You're basically 
describing exactly what does happen anyways, only you're delaying 
more requests than necessary. That intervening time isn't really 
idle, it's filled with all the requests that were delayed during the 
previous large seek...

 

Once the first request has been fulfilled, the driver can now 
schedule the rest of the queued-up requests in disk-layout order.

I really don't see how this is any different between a system that 
has tagged queueing to the disks and one that doesn't.  The only 
difference is where the queueing happens.
  

And *when* it happens. Instead of being able to issue requests while 
a large seek is happening and having some of them satisfied they have 
to wait until that seek is finished and get acted on during the next 
large seek.

If my theory is correct then I would expect bandwidth to be 
essentially equivalent but the latency on SATA drives to be increased 
by about 50% of the average seek time. Ie, while a busy SCSI drive 
can satisfy most requests in about 10ms a busy SATA drive would 
satisfy most requests in 15ms. (add to that that 10k RPM and 15kRPM 
SCSI drives have even lower seek times and no such IDE/SATA drives 
exist...)

In reality higher latency feeds into a system feedback loop causing 
your application to run slower causing bandwidth demands to be lower 
as well. It's often hard to distinguish root causes from symptoms 
when optimizing complex systems.

 


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if 
your
 joining column's datatypes do not match

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] How to improve db performance with $7K?

Tom Lane wrote:
 Kevin Brown [EMAIL PROTECTED] writes:
  I really don't see how this is any different between a system that has
  tagged queueing to the disks and one that doesn't.  The only
  difference is where the queueing happens.  In the case of SCSI, the
  queueing happens on the disks (or at least on the controller).  In the
  case of SATA, the queueing happens in the kernel.
 
 That's basically what it comes down to: SCSI lets the disk drive itself
 do the low-level I/O scheduling whereas the ATA spec prevents the drive
 from doing so (unless it cheats, ie, caches writes).  Also, in SCSI it's
 possible for the drive to rearrange reads as well as writes --- which
 AFAICS is just not possible in ATA.  (Maybe in the newest spec...)
 
 The reason this is so much more of a win than it was when ATA was
 designed is that in modern drives the kernel has very little clue about
 the physical geometry of the disk.  Variable-size tracks, bad-block
 sparing, and stuff like that make for a very hard-to-predict mapping
 from linear sector addresses to actual disk locations.  

Yeah, but it's not clear to me, at least, that this is a first-order
consideration.  A second-order consideration, sure, I'll grant that.

What I mean is that when it comes to scheduling disk activity,
knowledge of the specific physical geometry of the disk isn't really
important.  What's important is whether or not the disk conforms to a
certain set of expectations.  Namely, that the general organization is
such that addressing the blocks in block number order guarantees
maximum throughput.

Now, bad block remapping destroys that guarantee, but unless you've
got a LOT of bad blocks, it shouldn't destroy your performance, right?

 Combine that with the fact that the drive controller can be much
 smarter than it was twenty years ago, and you can see that the case
 for doing I/O scheduling in the kernel and not in the drive is
 pretty weak.

Well, I certainly grant that allowing the controller to do the I/O
scheduling is faster than having the kernel do it, as long as it can
handle insertion of new requests into the list while it's in the
middle of executing a request.  The most obvious case is when the head
is in motion and the new request can be satisfied by reading from the
media between where the head is at the time of the new request and
where the head is being moved to.

My argument is that a sufficiently smart kernel scheduler *should*
yield performance results that are reasonably close to what you can
get with that feature.  Perhaps not quite as good, but reasonably
close.  It shouldn't be an orders-of-magnitude type difference.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] How to improve db performance with $7K?

2005-04-14 Thread Alex Turner

3ware claim that their 'software' implemented command queueing
performs at 95% effectiveness compared to the hardware queueing on a
SCSI drive, so I would say that they agree with you.

I'm still learning, but as I read it, the bits are split across the
platters and there is only 'one' head, but happens to be reading from
multiple platters.  The 'further' in linear distance the data is from
the current position, the longer it's going to take to get there. 
This seems to be true based on a document that was circulated.  A hard
drive takes considerable amount of time to 'find' a track on the
platter compared to the rotational speed, which would agree with the
fact that you can read 70MB/sec, but it takes up to 13ms to seek.

the ATA protocol is just how the HBA communicates with the drive,
there is no reason why the HBA can't reschedule reads and writes just
the like SCSI drive would do natively, and this is what infact 3ware
claims.  I get the feeling based on my own historical experience that
generaly drives don't just have a bunch of bad blocks.  This all leads
me to believe that you can predict with pretty good accuracy how
expensive it is to retrieve a given block knowing it's linear
increment.

Alex Turner
netEconomist

On 4/14/05, Kevin Brown [EMAIL PROTECTED] wrote:
 Tom Lane wrote:
  Kevin Brown [EMAIL PROTECTED] writes:
   I really don't see how this is any different between a system that has
   tagged queueing to the disks and one that doesn't.  The only
   difference is where the queueing happens.  In the case of SCSI, the
   queueing happens on the disks (or at least on the controller).  In the
   case of SATA, the queueing happens in the kernel.
 
  That's basically what it comes down to: SCSI lets the disk drive itself
  do the low-level I/O scheduling whereas the ATA spec prevents the drive
  from doing so (unless it cheats, ie, caches writes).  Also, in SCSI it's
  possible for the drive to rearrange reads as well as writes --- which
  AFAICS is just not possible in ATA.  (Maybe in the newest spec...)
 
  The reason this is so much more of a win than it was when ATA was
  designed is that in modern drives the kernel has very little clue about
  the physical geometry of the disk.  Variable-size tracks, bad-block
  sparing, and stuff like that make for a very hard-to-predict mapping
  from linear sector addresses to actual disk locations.
 
 Yeah, but it's not clear to me, at least, that this is a first-order
 consideration.  A second-order consideration, sure, I'll grant that.
 
 What I mean is that when it comes to scheduling disk activity,
 knowledge of the specific physical geometry of the disk isn't really
 important.  What's important is whether or not the disk conforms to a
 certain set of expectations.  Namely, that the general organization is
 such that addressing the blocks in block number order guarantees
 maximum throughput.
 
 Now, bad block remapping destroys that guarantee, but unless you've
 got a LOT of bad blocks, it shouldn't destroy your performance, right?
 
  Combine that with the fact that the drive controller can be much
  smarter than it was twenty years ago, and you can see that the case
  for doing I/O scheduling in the kernel and not in the drive is
  pretty weak.
 
 Well, I certainly grant that allowing the controller to do the I/O
 scheduling is faster than having the kernel do it, as long as it can
 handle insertion of new requests into the list while it's in the
 middle of executing a request.  The most obvious case is when the head
 is in motion and the new request can be satisfied by reading from the
 media between where the head is at the time of the new request and
 where the head is being moved to.
 
 My argument is that a sufficiently smart kernel scheduler *should*
 yield performance results that are reasonably close to what you can
 get with that feature.  Perhaps not quite as good, but reasonably
 close.  It shouldn't be an orders-of-magnitude type difference.
 
 --
 Kevin Brown   [EMAIL PROTECTED]
 
 ---(end of broadcast)---
 TIP 8: explain analyze is your friend


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] How to improve db performance with $7K?

Kevin Brown [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 The reason this is so much more of a win than it was when ATA was
 designed is that in modern drives the kernel has very little clue about
 the physical geometry of the disk.  Variable-size tracks, bad-block
 sparing, and stuff like that make for a very hard-to-predict mapping
 from linear sector addresses to actual disk locations.  

 What I mean is that when it comes to scheduling disk activity,
 knowledge of the specific physical geometry of the disk isn't really
 important.

Oh?

Yes, you can probably assume that blocks with far-apart numbers are
going to require a big seek, and you might even be right in supposing
that a block with an intermediate number should be read on the way.
But you have no hope at all of making the right decisions at a more
local level --- say, reading various sectors within the same cylinder
in an optimal fashion.  You don't know where the track boundaries are,
so you can't schedule in a way that minimizes rotational latency.
You're best off to throw all the requests at the drive together and
let the drive sort it out.

This is not to say that there's not a place for a kernel-side scheduler
too.  The drive will probably have a fairly limited number of slots in
its command queue.  The optimal thing is for those slots to be filled
with requests that are in the same area of the disk.  So you can still
get some mileage out of an elevator algorithm that works on logical
block numbers to give the drive requests for nearby block numbers at the
same time.  But there's also a lot of use in letting the drive do its
own low-level scheduling.

 My argument is that a sufficiently smart kernel scheduler *should*
 yield performance results that are reasonably close to what you can
 get with that feature.  Perhaps not quite as good, but reasonably
 close.  It shouldn't be an orders-of-magnitude type difference.

That might be the case with respect to decisions about long seeks,
but not with respect to rotational latency.  The kernel simply hasn't
got the information.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] How to improve db performance with $7K?

Tom Lane wrote:
 Kevin Brown [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  The reason this is so much more of a win than it was when ATA was
  designed is that in modern drives the kernel has very little clue about
  the physical geometry of the disk.  Variable-size tracks, bad-block
  sparing, and stuff like that make for a very hard-to-predict mapping
  from linear sector addresses to actual disk locations.  
 
  What I mean is that when it comes to scheduling disk activity,
  knowledge of the specific physical geometry of the disk isn't really
  important.
 
 Oh?
 
 Yes, you can probably assume that blocks with far-apart numbers are
 going to require a big seek, and you might even be right in supposing
 that a block with an intermediate number should be read on the way.
 But you have no hope at all of making the right decisions at a more
 local level --- say, reading various sectors within the same cylinder
 in an optimal fashion.  You don't know where the track boundaries are,
 so you can't schedule in a way that minimizes rotational latency.

This is true, but has to be examined in the context of the workload.

If the workload is a sequential read, for instance, then the question
becomes whether or not giving the controller a set of sequential
blocks (in block ID order) will get you maximum read throughput.
Given that the manufacturers all attempt to generate the biggest read
throughput numbers, I think it's reasonable to assume that (a) the
sectors are ordered within a cylinder such that reading block x + 1
immediately after block x will incur the smallest possible amount of
delay if requested quickly enough, and (b) the same holds true when
block x + 1 is on the next cylinder.

In the case of pure random reads, you'll end up having to wait an
average of half of a rotation before beginning the read.  Where SCSI
buys you something here is when you have sequential chunks of reads
that are randomly distributed.  The SCSI drive can determine which
block in the set to start with first.  But for that to really be a big
win, the chunks themselves would have to span more than half a track
at least, else you'd have a greater than half a track gap in the
middle of your two sorted sector lists for that track (a really
well-engineered SCSI disk could take advantage of the fact that there
are multiple platters and fill the gap with reads from a different
platter).


Admittedly, this can be quite a big win.  With an average rotational
latency of 4 milliseconds on a 7200 RPM disk, being able to begin the
read at the earliest possible moment will shave at most 25% off the
total average random-access latency, if the average seek time is 12
milliseconds.

 That might be the case with respect to decisions about long seeks,
 but not with respect to rotational latency.  The kernel simply hasn't
 got the information.

True, but that should reduce the total latency by something like 17%
(on average).  Not trivial, to be sure, but not an order of magnitude,
either.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] How to improve db performance with $7K?

Kevin Brown [EMAIL PROTECTED] writes:
 In the case of pure random reads, you'll end up having to wait an
 average of half of a rotation before beginning the read.

You're assuming the conclusion.  The above is true if the disk is handed
one request at a time by a kernel that doesn't have any low-level timing
information.  If there are multiple random requests on the same track,
the drive has an opportunity to do better than that --- if it's got all
the requests in hand.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] How to improve db performance with $7K?

2005-04-07 Thread Douglas J. Trainor

A good one page discussion on the future of SCSI and SATA can
be found in the latest CHIPS (The Department of the Navy Information
Technology Magazine, formerly CHIPS AHOY) in an article by
Patrick G.  Koehler and Lt. Cmdr. Stan Bush.
Click below if you don't mind being logged visiting Space and Naval
Warfare Systems Center Charleston:
http://www.chips.navy.mil/archives/05_Jan/web_pages/scuzzy.htm
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] How to improve db performance with $7K?

2005-04-07 Thread Richard_D_Levine

Another simple question: Why is SCSI more expensive?  After the
eleventy-millionth controller is made, it seems like SCSI and SATA are
using a controller board and a spinning disk.  Is somebody still making
money by licensing SCSI technology?

Rick

[EMAIL PROTECTED] wrote on 04/06/2005 11:58:33 PM:

 You asked for it!  ;-)

 If you want cheap, get SATA.  If you want fast under
 *load* conditions, get SCSI.  Everything else at this
 time is marketing hype, either intentional or learned.
 Ignoring dollars, expect to see SCSI beat SATA by 40%.

  * * * What I tell you three times is true * * *

 Also, compare the warranty you get with any SATA
 drive with any SCSI drive.  Yes, you still have some
 change leftover to buy more SATA drives when they
 fail, but... it fundamentally comes down to some
 actual implementation and not what is printed on
 the cardboard box.  Disk systems are bound by the
 rules of queueing theory.  You can hit the sales rep
 over the head with your queueing theory book.

 Ultra320 SCSI is king of the hill for high concurrency
 databases.  If you're only streaming or serving files,
 save some money and get a bunch of SATA drives.
 But if you're reading/writing all over the disk, the
 simple first-come-first-serve SATA heuristic will
 hose your performance under load conditions.

 Next year, they will *try* bring out some SATA cards
 that improve on first-come-first-serve, but they ain't
 here now.  There are a lot of rigged performance tests
 out there...  Maybe by the time they fix the queueing
 problems, serial Attached SCSI (a/k/a SAS) will be out.
 Looks like Ultra320 is the end of the line for parallel
 SCSI, as Ultra640 SCSI (a/k/a SPI-5) is dead in the
 water.

 Ultra320 SCSI.
 Ultra320 SCSI.
 Ultra320 SCSI.

 Serial Attached SCSI.
 Serial Attached SCSI.
 Serial Attached SCSI.

 For future trends, see:
 http://www.incits.org/archive/2003/in031163/in031163.htm

 douglas

 p.s. For extra credit, try comparing SATA and SCSI drives
 when they're 90% full.

 On Apr 6, 2005, at 8:32 PM, Alex Turner wrote:

  I guess I'm setting myself up here, and I'm really not being ignorant,
  but can someone explain exactly how is SCSI is supposed to better than
  SATA?
 
  Both systems use drives with platters.  Each drive can physically only
  read one thing at a time.
 
  SATA gives each drive it's own channel, but you have to share in SCSI.
   A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but
  SCSI can only do 320MB/sec across the entire array.
 
  What am I missing here?
 
  Alex Turner
  netEconomist


 ---(end of broadcast)---
 TIP 9: the planner will ignore your desire to choose an index scan if
your
   joining column's datatypes do not match


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] How to improve db performance with $7K?

2005-04-07 Thread Alex Turner

Based on the reading I'm doing, and somebody please correct me if I'm
wrong, it seems that SCSI drives contain an on disk controller that
has to process the tagged queue.  SATA-I doesn't have this.  This
additional controller, is basicaly an on board computer that figures
out the best order in which to process commands.  I believe you are
also paying for the increased tolerance that generates a better speed.
 If you compare an 80Gig 7200RPM IDE drive to a WD Raptor 76G 10k RPM
to a Seagate 10k.6 drive to a Seagate Cheatah 15k drive, each one
represents a step up in parts and technology, thereby generating a
cost increase (at least thats what the manufactures tell us).  I know
if you ever held a 15k drive in your hand, you can notice a
considerable weight difference between it and a 7200RPM IDE drive.

Alex Turner
netEconomist

On Apr 7, 2005 11:37 AM, [EMAIL PROTECTED]
[EMAIL PROTECTED] wrote:
 Another simple question: Why is SCSI more expensive?  After the
 eleventy-millionth controller is made, it seems like SCSI and SATA are
 using a controller board and a spinning disk.  Is somebody still making
 money by licensing SCSI technology?
 
 Rick
 
 [EMAIL PROTECTED] wrote on 04/06/2005 11:58:33 PM:
 
  You asked for it!  ;-)
 
  If you want cheap, get SATA.  If you want fast under
  *load* conditions, get SCSI.  Everything else at this
  time is marketing hype, either intentional or learned.
  Ignoring dollars, expect to see SCSI beat SATA by 40%.
 
   * * * What I tell you three times is true * * *
 
  Also, compare the warranty you get with any SATA
  drive with any SCSI drive.  Yes, you still have some
  change leftover to buy more SATA drives when they
  fail, but... it fundamentally comes down to some
  actual implementation and not what is printed on
  the cardboard box.  Disk systems are bound by the
  rules of queueing theory.  You can hit the sales rep
  over the head with your queueing theory book.
 
  Ultra320 SCSI is king of the hill for high concurrency
  databases.  If you're only streaming or serving files,
  save some money and get a bunch of SATA drives.
  But if you're reading/writing all over the disk, the
  simple first-come-first-serve SATA heuristic will
  hose your performance under load conditions.
 
  Next year, they will *try* bring out some SATA cards
  that improve on first-come-first-serve, but they ain't
  here now.  There are a lot of rigged performance tests
  out there...  Maybe by the time they fix the queueing
  problems, serial Attached SCSI (a/k/a SAS) will be out.
  Looks like Ultra320 is the end of the line for parallel
  SCSI, as Ultra640 SCSI (a/k/a SPI-5) is dead in the
  water.
 
  Ultra320 SCSI.
  Ultra320 SCSI.
  Ultra320 SCSI.
 
  Serial Attached SCSI.
  Serial Attached SCSI.
  Serial Attached SCSI.
 
  For future trends, see:
  http://www.incits.org/archive/2003/in031163/in031163.htm
 
  douglas
 
  p.s. For extra credit, try comparing SATA and SCSI drives
  when they're 90% full.
 
  On Apr 6, 2005, at 8:32 PM, Alex Turner wrote:
 
   I guess I'm setting myself up here, and I'm really not being ignorant,
   but can someone explain exactly how is SCSI is supposed to better than
   SATA?
  
   Both systems use drives with platters.  Each drive can physically only
   read one thing at a time.
  
   SATA gives each drive it's own channel, but you have to share in SCSI.
A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but
   SCSI can only do 320MB/sec across the entire array.
  
   What am I missing here?
  
   Alex Turner
   netEconomist
 
 
  ---(end of broadcast)---
  TIP 9: the planner will ignore your desire to choose an index scan if
 your
joining column's datatypes do not match
 


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] How to improve db performance with $7K?

2005-04-07 Thread Richard_D_Levine

Yep, that's it, as well as increased quality control. I found this from
Seagate:

http://www.seagate.com/content/docs/pdf/whitepaper/D2c_More_than_Interface_ATA_vs_SCSI_042003.pdf

With this quote (note that ES stands for Enterprise System and PS stands
for Personal System):

There is significantly more silicon on ES products. The following
comparison comes from a study done in 2000:
the ES ASIC gate count is more than 2x a PS drive,
the embedded SRAM space for program code is 2x,
the permanent flash memory for program code is 2x,
data SRAM and cache SRAM space is more than 10x.
The complexity of the SCSI/FC interface compared to the
IDE/ATA interface shows up here due in part to the more
complex system architectures in which ES drives find themselves.
ES interfaces support multiple initiators or hosts. The
drive must keep track of separate sets of information for each
host to which it is attached, e.g., maintaining the processor
pointer sets for multiple initiators and tagged commands.
The capability of SCSI/FC to efficiently process commands
and tasks in parallel has also resulted in a higher overhead
kernel structure for the firmware. All of these complexities
and an overall richer command set result in the need for a
more expensive PCB to carry the electronics.

Rick

Alex Turner [EMAIL PROTECTED] wrote on 04/07/2005 10:46:31 AM:

Based on the reading I'm doing, and somebody please correct me if I'm
wrong, it seems that SCSI drives contain an on disk controller that
has to process the tagged queue. SATA-I doesn't have this. This
additional controller, is basicaly an on board computer that figures
out the best order in which to process commands. I believe you are
also paying for the increased tolerance that generates a better speed.
If you compare an 80Gig 7200RPM IDE drive to a WD Raptor 76G 10k RPM
to a Seagate 10k.6 drive to a Seagate Cheatah 15k drive, each one
represents a step up in parts and technology, thereby generating a
cost increase (at least thats what the manufactures tell us). I know
if you ever held a 15k drive in your hand, you can notice a
considerable weight difference between it and a 7200RPM IDE drive.

Alex Turner
netEconomist

On Apr 7, 2005 11:37 AM, [EMAIL PROTECTED]
[EMAIL PROTECTED] wrote:
Another simple question: Why is SCSI more expensive? After the
eleventy-millionth controller is made, it seems like SCSI and SATA are
using a controller board and a spinning disk. Is somebody still making
money by licensing SCSI technology?

Rick

[EMAIL PROTECTED] wrote on 04/06/2005 11:58:33 PM:

You asked for it! ;-)

If you want cheap, get SATA. If you want fast under
*load* conditions, get SCSI. Everything else at this
time is marketing hype, either intentional or learned.
Ignoring dollars, expect to see SCSI beat SATA by 40%.

* * * What I tell you three times is true * * *

Also, compare the warranty you get with any SATA
drive with any SCSI drive. Yes, you still have some
change leftover to buy more SATA drives when they
fail, but... it fundamentally comes down to some
actual implementation and not what is printed on
the cardboard box. Disk systems are bound by the
rules of queueing theory. You can hit the sales rep
over the head with your queueing theory book.

Ultra320 SCSI is king of the hill for high concurrency
databases. If you're only streaming or serving files,
save some money and get a bunch of SATA drives.
But if you're reading/writing all over the disk, the
simple first-come-first-serve SATA heuristic will
hose your performance under load conditions.

Next year, they will *try* bring out some SATA cards
that improve on first-come-first-serve, but they ain't
here now. There are a lot of rigged performance tests
out there... Maybe by the time they fix the queueing
problems, serial Attached SCSI (a/k/a SAS) will be out.
Looks like Ultra320 is the end of the line for parallel
SCSI, as Ultra640 SCSI (a/k/a SPI-5) is dead in the
water.

Ultra320 SCSI.
Ultra320 SCSI.
Ultra320 SCSI.

Serial Attached SCSI.
Serial Attached SCSI.
Serial Attached SCSI.

For future trends, see:
http://www.incits.org/archive/2003/in031163/in031163.htm

douglas

p.s. For extra credit, try comparing SATA and SCSI drives
when they're 90% full.

On Apr 6, 2005, at 8:32 PM, Alex Turner wrote:

I guess I'm setting myself up here, and I'm really not being
ignorant,
but can someone explain exactly how is SCSI is supposed to better
than
SATA?

Both systems use drives with platters. Each drive can physically
only
read one thing at a time.

SATA gives each drive it's own channel, but you have to share in
SCSI.
A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive,
but
SCSI can only do 320MB/sec across the entire array.

What am I missing here?

Alex

Re: [PERFORM] How to improve db performance with $7K?

2005-04-06 Thread William Yu

Alex Turner wrote:
I'm no drive expert, but it seems to me that our write performance is
excellent.  I think what most are concerned about is OLTP where you
are doing heavy write _and_ heavy read performance at the same time.
Our system is mostly read during the day, but we do a full system
update everynight that is all writes, and it's very fast compared to
the smaller SCSI system we moved off of.  Nearly a 6x spead
improvement, as fast as 900 rows/sec with a 48 byte record, one row
per transaction.
I've started with SATA in a multi-read/multi-write environment. While it 
ran pretty good with 1 thread writing, the addition of a 2nd thread 
(whether reading or writing) would cause exponential slowdowns.

I suffered through this for a week and then switched to SCSI. Single 
threaded performance was pretty similar but with the advanced command 
queueing SCSI has, I was able to do multiple reads/writes simultaneously 
with only a small performance hit for each thread.

Perhaps having a SATA caching raid controller might help this situation. 
I don't know. It's pretty hard justifying buying a $$$ 3ware controller 
just to test it when you could spend the same money on SCSI and have a 
guarantee it'll work good under multi-IO scenarios.

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] How to improve db performance with $7K?

2005-04-06 Thread William Yu

It's the same money if you factor in the 3ware controller. Even without 
a caching controller, SCSI works good in multi-threaded IO (not 
withstanding crappy shit from Dell or Compaq). You can get such cards 
from LSI for $75. And of course, many server MBs come with LSI 
controllers built-in. Our older 32-bit production servers all use Linux 
software RAID w/ SCSI and there's no issues when multiple 
users/processes hit the DB.

*Maybe* a 3ware controller w/ onboard cache + battery backup might do 
much better for multi-threaded IO than just plain-jane SATA. 
Unfortunately, I have not been able to find anything online that can 
confirm or deny this. Hence, the choice is spend $$$ on the 3ware 
controller and hope it meets your needs -- or spend $$$ on SCSI drives 
and be sure.

Now if you want to run such tests, we'd all be delighted with to see the 
results so we have another option for building servers.

Alex Turner wrote:
It's hardly the same money, the drives are twice as much.
It's all about the controller baby with any kind of dive.  A bad SCSI
controller will give sucky performance too, believe me.  We had a
Compaq Smart Array 5304, and it's performance was _very_ sub par.
If someone has a simple benchmark test database to run, I would be
happy to run it on our hardware here.
Alex Turner
On Apr 6, 2005 3:30 AM, William Yu [EMAIL PROTECTED] wrote:
Alex Turner wrote:
I'm no drive expert, but it seems to me that our write performance is
excellent.  I think what most are concerned about is OLTP where you
are doing heavy write _and_ heavy read performance at the same time.
Our system is mostly read during the day, but we do a full system
update everynight that is all writes, and it's very fast compared to
the smaller SCSI system we moved off of.  Nearly a 6x spead
improvement, as fast as 900 rows/sec with a 48 byte record, one row
per transaction.
I've started with SATA in a multi-read/multi-write environment. While it
ran pretty good with 1 thread writing, the addition of a 2nd thread
(whether reading or writing) would cause exponential slowdowns.
I suffered through this for a week and then switched to SCSI. Single
threaded performance was pretty similar but with the advanced command
queueing SCSI has, I was able to do multiple reads/writes simultaneously
with only a small performance hit for each thread.
Perhaps having a SATA caching raid controller might help this situation.
I don't know. It's pretty hard justifying buying a $$$ 3ware controller
just to test it when you could spend the same money on SCSI and have a
guarantee it'll work good under multi-IO scenarios.
---(end of broadcast)---
TIP 8: explain analyze is your friend

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] How to improve db performance with $7K?

Well - unfortuantely software RAID isn't appropriate for everyone, and
some of us need a hardware RAID controller.  The LSI Megaraid 320-2
card is almost exactly the same price as the 3ware 9500S-12 card
(although I will conceed that a 320-2 card can handle at most 2x14
devices compare with the 12 on the 9500S).

If someone can come up with a test, I will be happy to run it and see
how it goes.  I would be _very_ interested in the results having just
spent $7k on a new DB server!!

I have also seen really bad performance out of SATA.  It was with
either an on-board controller, or a cheap RAID controller from
HighPoint.  As soon as I put in a decent controller, things went much
better.  I think it's unfair to base your opinion of SATA from a test
that had a poor controler.

I know I'm not the only one here running SATA RAID and being very
satisfied with the results.

Thanks,

Alex Turner
netEconomist

On Apr 6, 2005 4:01 PM, William Yu [EMAIL PROTECTED] wrote:
 It's the same money if you factor in the 3ware controller. Even without
 a caching controller, SCSI works good in multi-threaded IO (not
 withstanding crappy shit from Dell or Compaq). You can get such cards
 from LSI for $75. And of course, many server MBs come with LSI
 controllers built-in. Our older 32-bit production servers all use Linux
 software RAID w/ SCSI and there's no issues when multiple
 users/processes hit the DB.
 
 *Maybe* a 3ware controller w/ onboard cache + battery backup might do
 much better for multi-threaded IO than just plain-jane SATA.
 Unfortunately, I have not been able to find anything online that can
 confirm or deny this. Hence, the choice is spend $$$ on the 3ware
 controller and hope it meets your needs -- or spend $$$ on SCSI drives
 and be sure.
 
 Now if you want to run such tests, we'd all be delighted with to see the
 results so we have another option for building servers.
 
 
 Alex Turner wrote:
  It's hardly the same money, the drives are twice as much.
 
  It's all about the controller baby with any kind of dive.  A bad SCSI
  controller will give sucky performance too, believe me.  We had a
  Compaq Smart Array 5304, and it's performance was _very_ sub par.
 
  If someone has a simple benchmark test database to run, I would be
  happy to run it on our hardware here.
 
  Alex Turner
 
  On Apr 6, 2005 3:30 AM, William Yu [EMAIL PROTECTED] wrote:
 
 Alex Turner wrote:
 
 I'm no drive expert, but it seems to me that our write performance is
 excellent.  I think what most are concerned about is OLTP where you
 are doing heavy write _and_ heavy read performance at the same time.
 
 Our system is mostly read during the day, but we do a full system
 update everynight that is all writes, and it's very fast compared to
 the smaller SCSI system we moved off of.  Nearly a 6x spead
 improvement, as fast as 900 rows/sec with a 48 byte record, one row
 per transaction.
 
 I've started with SATA in a multi-read/multi-write environment. While it
 ran pretty good with 1 thread writing, the addition of a 2nd thread
 (whether reading or writing) would cause exponential slowdowns.
 
 I suffered through this for a week and then switched to SCSI. Single
 threaded performance was pretty similar but with the advanced command
 queueing SCSI has, I was able to do multiple reads/writes simultaneously
 with only a small performance hit for each thread.
 
 Perhaps having a SATA caching raid controller might help this situation.
 I don't know. It's pretty hard justifying buying a $$$ 3ware controller
 just to test it when you could spend the same money on SCSI and have a
 guarantee it'll work good under multi-IO scenarios.
 
 ---(end of broadcast)---
 TIP 8: explain analyze is your friend
 
 
 
  ---(end of broadcast)---
  TIP 2: you can get off all lists at once with the unregister command
  (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
 
 
 ---(end of broadcast)---
 TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] How to improve db performance with $7K?

2005-04-06 Thread Jim C. Nasby

Sorry if I'm pointing out the obvious here, but it seems worth
mentioning. AFAIK all 3ware controllers are setup so that each SATA
drive gets it's own SATA bus. My understanding is that by and large,
SATA still suffers from a general inability to have multiple outstanding
commands on the bus at once, unlike SCSI. Therefore, to get good
performance out of SATA you need to have a seperate bus for each drive.
Theoretically, it shouldn't really matter that it's SATA over ATA, other
than I certainly wouldn't want to try and cram 8 ATA cables into a
machine...

Incidentally, when we were investigating storage options at a previous
job we talked to someone who deals with RS/6000 storage. He had a bunch
of info about their serial controller protocol (which I can't think of
the name of) vs SCSI. SCSI had a lot more overhead, so you could end up
saturating even a 160MB SCSI bus with only 2 or 3 drives.

People are finally realizing how important bandwidth has become in
modern machines. Memory bandwidth is why RS/6000 was (and maybe still
is) cleaning Sun's clock, and it's why the Opteron blows Itaniums out of
the water. Likewise it's why SCSI is so much better than IDE (unless you
just give each drive it's own dedicated bandwidth).
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] How to improve db performance with $7K?

I guess I'm setting myself up here, and I'm really not being ignorant,
but can someone explain exactly how is SCSI is supposed to better than
SATA?

Both systems use drives with platters.  Each drive can physically only
read one thing at a time.

SATA gives each drive it's own channel, but you have to share in SCSI.
 A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but
SCSI can only do 320MB/sec across the entire array.

What am I missing here?

Alex Turner
netEconomist

On Apr 6, 2005 5:41 PM, Jim C. Nasby [EMAIL PROTECTED] wrote:
 Sorry if I'm pointing out the obvious here, but it seems worth
 mentioning. AFAIK all 3ware controllers are setup so that each SATA
 drive gets it's own SATA bus. My understanding is that by and large,
 SATA still suffers from a general inability to have multiple outstanding
 commands on the bus at once, unlike SCSI. Therefore, to get good
 performance out of SATA you need to have a seperate bus for each drive.
 Theoretically, it shouldn't really matter that it's SATA over ATA, other
 than I certainly wouldn't want to try and cram 8 ATA cables into a
 machine...
 
 Incidentally, when we were investigating storage options at a previous
 job we talked to someone who deals with RS/6000 storage. He had a bunch
 of info about their serial controller protocol (which I can't think of
 the name of) vs SCSI. SCSI had a lot more overhead, so you could end up
 saturating even a 160MB SCSI bus with only 2 or 3 drives.
 
 People are finally realizing how important bandwidth has become in
 modern machines. Memory bandwidth is why RS/6000 was (and maybe still
 is) cleaning Sun's clock, and it's why the Opteron blows Itaniums out of
 the water. Likewise it's why SCSI is so much better than IDE (unless you
 just give each drive it's own dedicated bandwidth).
 --
 Jim C. Nasby, Database Consultant   [EMAIL PROTECTED]
 Give your computer some brain candy! www.distributed.net Team #1828
 
 Windows: Where do you want to go today?
 Linux: Where do you want to go tomorrow?
 FreeBSD: Are you guys coming, or what?


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] How to improve db performance with $7K?

Ok - so I found this fairly good online review of various SATA cards
out there, with 3ware not doing too hot on RAID 5, but ok on RAID 10.

http://www.tweakers.net/reviews/557/

Very interesting stuff.

Alex Turner
netEconomist

On Apr 6, 2005 7:32 PM, Alex Turner [EMAIL PROTECTED] wrote:
 I guess I'm setting myself up here, and I'm really not being ignorant,
 but can someone explain exactly how is SCSI is supposed to better than
 SATA?
 
 Both systems use drives with platters.  Each drive can physically only
 read one thing at a time.
 
 SATA gives each drive it's own channel, but you have to share in SCSI.
  A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but
 SCSI can only do 320MB/sec across the entire array.
 
 What am I missing here?
 
 Alex Turner
 netEconomist
 
 On Apr 6, 2005 5:41 PM, Jim C. Nasby [EMAIL PROTECTED] wrote:
  Sorry if I'm pointing out the obvious here, but it seems worth
  mentioning. AFAIK all 3ware controllers are setup so that each SATA
  drive gets it's own SATA bus. My understanding is that by and large,
  SATA still suffers from a general inability to have multiple outstanding
  commands on the bus at once, unlike SCSI. Therefore, to get good
  performance out of SATA you need to have a seperate bus for each drive.
  Theoretically, it shouldn't really matter that it's SATA over ATA, other
  than I certainly wouldn't want to try and cram 8 ATA cables into a
  machine...
 
  Incidentally, when we were investigating storage options at a previous
  job we talked to someone who deals with RS/6000 storage. He had a bunch
  of info about their serial controller protocol (which I can't think of
  the name of) vs SCSI. SCSI had a lot more overhead, so you could end up
  saturating even a 160MB SCSI bus with only 2 or 3 drives.
 
  People are finally realizing how important bandwidth has become in
  modern machines. Memory bandwidth is why RS/6000 was (and maybe still
  is) cleaning Sun's clock, and it's why the Opteron blows Itaniums out of
  the water. Likewise it's why SCSI is so much better than IDE (unless you
  just give each drive it's own dedicated bandwidth).
  --
  Jim C. Nasby, Database Consultant   [EMAIL PROTECTED]
  Give your computer some brain candy! www.distributed.net Team #1828
 
  Windows: Where do you want to go today?
  Linux: Where do you want to go tomorrow?
  FreeBSD: Are you guys coming, or what?
 


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] How to improve db performance with $7K?

Ok - I take it back - I'm reading through this now, and realising that
the reviews are pretty clueless in several places...


On Apr 6, 2005 8:12 PM, Alex Turner [EMAIL PROTECTED] wrote:
 Ok - so I found this fairly good online review of various SATA cards
 out there, with 3ware not doing too hot on RAID 5, but ok on RAID 10.
 
 http://www.tweakers.net/reviews/557/
 
 Very interesting stuff.
 
 Alex Turner
 netEconomist
 
 On Apr 6, 2005 7:32 PM, Alex Turner [EMAIL PROTECTED] wrote:
  I guess I'm setting myself up here, and I'm really not being ignorant,
  but can someone explain exactly how is SCSI is supposed to better than
  SATA?
 
  Both systems use drives with platters.  Each drive can physically only
  read one thing at a time.
 
  SATA gives each drive it's own channel, but you have to share in SCSI.
   A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but
  SCSI can only do 320MB/sec across the entire array.
 
  What am I missing here?
 
  Alex Turner
  netEconomist
 
  On Apr 6, 2005 5:41 PM, Jim C. Nasby [EMAIL PROTECTED] wrote:
   Sorry if I'm pointing out the obvious here, but it seems worth
   mentioning. AFAIK all 3ware controllers are setup so that each SATA
   drive gets it's own SATA bus. My understanding is that by and large,
   SATA still suffers from a general inability to have multiple outstanding
   commands on the bus at once, unlike SCSI. Therefore, to get good
   performance out of SATA you need to have a seperate bus for each drive.
   Theoretically, it shouldn't really matter that it's SATA over ATA, other
   than I certainly wouldn't want to try and cram 8 ATA cables into a
   machine...
  
   Incidentally, when we were investigating storage options at a previous
   job we talked to someone who deals with RS/6000 storage. He had a bunch
   of info about their serial controller protocol (which I can't think of
   the name of) vs SCSI. SCSI had a lot more overhead, so you could end up
   saturating even a 160MB SCSI bus with only 2 or 3 drives.
  
   People are finally realizing how important bandwidth has become in
   modern machines. Memory bandwidth is why RS/6000 was (and maybe still
   is) cleaning Sun's clock, and it's why the Opteron blows Itaniums out of
   the water. Likewise it's why SCSI is so much better than IDE (unless you
   just give each drive it's own dedicated bandwidth).
   --
   Jim C. Nasby, Database Consultant   [EMAIL PROTECTED]
   Give your computer some brain candy! www.distributed.net Team #1828
  
   Windows: Where do you want to go today?
   Linux: Where do you want to go tomorrow?
   FreeBSD: Are you guys coming, or what?
  
 


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] How to improve db performance with $7K?

2005-04-06 Thread Greg Stark


Alex Turner [EMAIL PROTECTED] writes:

 SATA gives each drive it's own channel, but you have to share in SCSI.
  A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but
 SCSI can only do 320MB/sec across the entire array.

SCSI controllers often have separate channels for each device too.

In any case the issue with the IDE protocol is that fundamentally you can only
have a single command pending. SCSI can have many commands pending. This is
especially important for a database like postgres that may be busy committing
one transaction while another is trying to read. Having several commands
queued on the drive gives it a chance to execute any that are on the way to
the committing transaction.

However I'm under the impression that 3ware has largely solved this problem.
Also, if you save a few dollars and can afford one additional drive that
additional drive may improve your array speed enough to overcome that
inefficiency.

-- 
greg


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] How to improve db performance with $7K?