Re: [PERFORM] Scaling further up

2004-06-10 Thread Anjan Dave
Can you describe the vendors/components of a cheap SAN setup?

Thanks,
Anjan

-Original Message-
From: Rod Taylor [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 02, 2004 5:57 PM
To: Scott Marlowe
Cc: Anjan Dave; Chris Ruprecht; [EMAIL PROTECTED]; William Yu;
Postgresql Performance
Subject: Re: [PERFORM] Scaling further up


 For speed, the X86 32 and 64 bit architectures seem to be noticeable
 faster than Sparc.  However, running Linux or BSD on Sparc make them 
 pretty fast too, but you lose the fault tolerant support for things
like 
 hot swappable CPUs or memory.

Agreed.. You can get a Quad Opteron with 16GB memory for around 20K.

Grab 3, a cheap SAN and setup a little master/slave replication with
failover (how is Slony coming?), and you're all set.



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Scaling further up

2004-06-09 Thread Anjan Dave
For the disks part - I am looking at a SAN implementation, and I will be
planning a separate RAID group for the WALs.

The controller is a PERC, with 128MB cache, and I think it is writeback.

Other than the disks, I am curious what other people are using in terms
of the horsepower needed. The Quad server has been keeping up, but we
are expecting quite high loads in the near future, and I am not sure if
just by having the disks on a high-end storage will do it.

Thanks,
Anjan


-Original Message-
From: Magnus Hagander [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 01, 2004 3:54 PM
To: Anjan Dave; [EMAIL PROTECTED]
Subject: RE: [PERFORM] Scaling further up


 All:
 
 We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running
RH9, PG 7.4.0. There's 
 an internal U320, 10K RPM RAID-10 setup on 4 drives.
 
 We are expecting a pretty high load, a few thousands of 'concurrent'
users executing either 
 select, insert, update, statments.

 What is the next step up in terms of  handling very heavy loads?
Clustering? 

I'd look at adding more disks first. Depending on what type of query
load you get, that box sounds like it will be very much I/O bound. More
spindles = more parallell operations = faster under load. Consider
adding 15KRPM disks as well, they're not all that much more expensive,
and should give you better performance than 10KRPM.

Also, make sure you put your WAL disks on a separate RAIDset if possible
(not just a separate partition on existing RAIDset).

Finally, if you don't already have it, look for a battery-backed RAID
controller that can do writeback-cacheing, and enable that. (Don't even
think about enabling it unless it's battery backed!) And add as much RAM
as you can to that controller.


//Magnus

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Scaling further up

2004-03-15 Thread Andrew Sullivan
On Fri, Mar 12, 2004 at 06:25:48PM -0500, Anjan Dave wrote:
 We upgraded from 8GB to 12GB RAM a month or so ago, but even in the
 past, I've never seen the system exhaust on it's system cache (~6GB, in
 'top'), while it's swapping.
 
 Some one had mentioned why not have the entire DB in memory? How do I
 configure that, for knowledge?

You don't.  It'll automatically be in memory if (a) you have enough
memory, (b) you don't have anything else on the machine using the
memory, and (c) it's been read at least one time.

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Scaling further up

2004-03-15 Thread Stephen Robert Norris
On Tue, 2004-03-16 at 07:28, Matt Davies wrote:
 This is the preferred method, but you could create a memory disk if running
 linux. This has several caveats, though.
 
 1. You may have to recompile the kernel for support.
 2. You must store the database on a hard drive partition during reboots.
 3. Because of #2 this option is generally useful if you have static content that
 is loaded to the MD upon startup of the system. 

And 4. You use twice as much memory - one lot for the FS, the second for
buffer cache.

It's generally going to be slower than simply doing some typical queries
to preload the data into buffer cache, I think.

Stephen


signature.asc
Description: This is a digitally signed message part


Re: [PERFORM] Scaling further up

2004-03-14 Thread Aaron Werman
Sorry about not chiming in before - I've been too swamped to think. I agree
with most of the points, but a lot of these posts are interesting and seem
to describe systems from an SA perspective to my DBA-centric view.

- Original Message - 
From: Marty Scholes [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, March 10, 2004 6:29 PM
Subject: Re: [PERFORM] Scaling further up


 I have some suggestions based on my anecdotal experience.

 1. This is a relatively small DB -- the working set will likely be in
 RAM at any moment in time, making read I/O time mostly irrelevant.

 2. The killer will be write times -- specifically log writes.  Small and
 heavily synchronized writes, log and data writes, will drag down an
 impressive hardware RAID setup.  We run mirrored hardware RAID 5 arrays
 with write back cache and are constantly seeking ways to improve write
 performance.  We do a lot of batch processing, though, so we do a lot of
 write I/Os.

My experience with RAID5 for streaming sequential writes is bad. This is
sometimes helped by the hardware caching to cover the cost of the additional
I/Os for striping (write through RAID5 + big cache acts like RAID 1+0 until
you run out of cache). Batch processing is different from high concurrency
transactions because it needs faster volume streaming, while TP is dependant
on the speed of ack'ing (few big writes with less synchronous waits vs. lots
of small writes which serialize everyone). (RAID 3 worked for me in the past
for logging, but I haven't used it in years.)


 3. Be very careful with battery backed write cache.  It usually works
 as advertised.  More than once in the past decade I have seen
 spontaneous cache corruption after power losss.  The corruption usually
 happens when some admin, including me, has assumed that the cache will
 ALWAYS survive a power failure unblemished and has no plan B.  Make
 sure you have a contingency plan for corruption, or don't enable the
cache.

I agree strongly. There is also the same problem with disk write back cache
and even with SCSI controllers with write through enabled. PITR would help
here. A lot of these problems are due to procedural error post crash.


 4. RAID 10 will likely have bigger stripe sizes on the RAID 0 portion of
 the setup, and might hinder, not help small write I/O performance.

In a high volume system without write caching you are almost always going to
see queuing, which can make the larger buffer mostly irrelevant, if it's not
huge. Write caching thrives on big  block sizes (which is a key reason why
Symmetrix doesn't do worse than it does) by reducing I/O counts. Most shops
I've set up or seen use mirroring or RAID 10 for logs. Note also that many
RAID 10 controllers in a non-write cached setup allows having a race between
the two writers, acknowledging when the first of the two completes -
increasing throughput by about 1/4.


 5. Most (almost all) of the I/O time will be due to the access time
 (head seek + head settle + rotational latency) and very little of the
 I/O time will due to data transfer time.  In other words, getting drives
 that provide faster transfer rates will barely improve performance.  The
 secret is lowering the access time.

True. This is very much a latency story. Even in volume batch, you can see
access time that clearly shows some other system configuration bottleneck
that happens elsewhere before hitting I/O capacity.


 6. A relatively cheap way to drastically drop the access time is to get
 large drive(s) and only use a portion of them for storage.  The less
 space used on the drive, the less area the heads need to cover for
 seeks.  At one extreme, you could make the partition the size of a
 single cylinder.  This would make access time (ignoring OS and
 controller overhead) identical to rotational latency, which is as low as
 4.2 ms for a cheap 7200 RPM drive.

This is a good strategy for VLDB, and may not be relevant in this case.

Also - big sequential writes and 15K rpm drives, in the case of
writethrough, is a beautiful thing - they look like a manufacturers' demo. A
primary performance role of a RDBMS is to convert random I/O to sequential
(by buffering reads and using a streaming log to defer random writes to
checkpoints). RDBMS's are the prime beneficiaries of the drive speed
improvements - since logging, backups, and copies are about the only things
(ignoring bad perl scripts and find commands) that generate loads of 50+
mB/sec.

/Aaron

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Scaling further up

2004-03-13 Thread Marty Scholes
I have some suggestions based on my anecdotal experience.

1. This is a relatively small DB -- the working set will likely be in 
RAM at any moment in time, making read I/O time mostly irrelevant.

2. The killer will be write times -- specifically log writes.  Small and 
heavily synchronized writes, log and data writes, will drag down an 
impressive hardware RAID setup.  We run mirrored hardware RAID 5 arrays 
with write back cache and are constantly seeking ways to improve write 
performance.  We do a lot of batch processing, though, so we do a lot of 
write I/Os.

3. Be very careful with battery backed write cache.  It usually works 
as advertised.  More than once in the past decade I have seen 
spontaneous cache corruption after power losss.  The corruption usually 
happens when some admin, including me, has assumed that the cache will 
ALWAYS survive a power failure unblemished and has no plan B.  Make 
sure you have a contingency plan for corruption, or don't enable the cache.

4. RAID 10 will likely have bigger stripe sizes on the RAID 0 portion of 
the setup, and might hinder, not help small write I/O performance.

5. Most (almost all) of the I/O time will be due to the access time 
(head seek + head settle + rotational latency) and very little of the 
I/O time will due to data transfer time.  In other words, getting drives 
that provide faster transfer rates will barely improve performance.  The 
secret is lowering the access time.

6. A relatively cheap way to drastically drop the access time is to get 
large drive(s) and only use a portion of them for storage.  The less 
space used on the drive, the less area the heads need to cover for 
seeks.  At one extreme, you could make the partition the size of a 
single cylinder.  This would make access time (ignoring OS and 
controller overhead) identical to rotational latency, which is as low as 
4.2 ms for a cheap 7200 RPM drive.

7. A drive with a 5 ms average service time, servicing 8 KB blocks, will 
yield as much as 1.6 MB/s sustained write throughput.  Not bad for a 
cheap uncached solution.  Any OS aggregation of writes during the 
fsync() call will further improve this number -- it is basically a lower 
bound for throughput.

8. Many people, especially managers, cannot stomach buying disk space 
and only using a portion of it.  In many cases, it seems more palatable 
to purchase a much more expensive solution to get to the same speeds.

Good luck.

scott.marlowe wrote:
On Wed, 3 Mar 2004, Paul Thomas wrote:

 
  On 02/03/2004 23:25 johnn wrote:
   [snip]
   random_page_cost should be set with the following things taken into
   account:
 - seek speed
 
  Which is not exactly the same thing as spindle speed as it's a 
combination
  of spindle speed and track-to-track speed. I think you'll find that a 
15K
  rpm disk, whilst it will probably have a lower seek time than a 10K rpm
  disk, won't have a proportionately (i.e., 2/3rds) lower seek time.

There are three factors that affect how fast you can get to the next
sector:
seek time
settle time
rotational latency
Most drives only list the first, and don't bother to mention the other
two.
On many modern drives, the seek times are around 5 to 10 milliseconds.
The settle time varies as well.  the longer the seek, the longer the
settle, generally.  This is the time it takes for the head to stop shaking
and rest quietly over a particular track.
Rotational Latency is the amount of time you have to wait, on average, for
the sector you want to come under the heads.
Assuming an 8 ms seek, and 2 ms settle (typical numbers), and that the
rotational latency on average is 1/2 of a rotation:  At 10k rpm, a
rotation takes 1/166.667 of a second, or 6 mS.  So, a half a rotation is
approximately 3 mS.  By going to a 15k rpm drive, the latency drops to 2
mS.  So, if we add them up, on the same basic drive, one being 10k and one
being 15k, we get:
10krpm: 8+2+3 = 13 mS
15krpm: 8+2+2 = 12 mS
So, based on the decrease in rotational latency being the only advantage
the 15krpm drive has over the 10krpm drive, we get an decrease in access
time of only 1 mS, or only about an 8% decrease in actual seek time.
So, if you're random page cost on 10krpm drives was 1.7, you'd need to
drop it to 1.57 or so to reflect the speed increase from 15krpm drives.
I.e. it's much more likely that going from 1 gig to 2 gigs of ram will
make a noticeable difference than going from 10k to 15k drives.




---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Scaling further up

2004-03-12 Thread Robert Treat
On Mon, 2004-03-08 at 11:40, William Yu wrote:
 Anjan Dave wrote:
  Great response, Thanks.
  
  Regarding 12GB memory and 13G db, and almost no I/O, one thing I don't
  understand is that even though the OS caches most of the memory and PG
  can use it if it needs it, why would the system swap (not much, only
  during peak times)? The SHMMAX is set to 512MB, shared_buffers is 150MB,
  effective cache size is 2GB, sort mem is 2MB, rest is default values. It
  also happens that a large query (reporting type) can hold up the other
  queries, and the load averages shoot up during peak times.
 
 In regards to your system going to swap, the only item I see is sort_mem 
 at 2MB. How many simultaneous transactions do you get? If you get 
 hundreds or thousands like your first message stated, every select sort 
 would take up 2MB of memory regardless of whether it needed it or not. 
 That could cause your swap activity during peak traffic.
 
 The only other item to bump up is the effective cache size -- I'd set it 
 to 12GB.
 

Was surprised that no one corrected this bit of erroneous info (or at
least I didn't see it) so thought I would for completeness. a basic
explanation is that sort_mem controls how much memory a given query is
allowed to use before spilling to disk, but it will not grab that much
memory if it doesn't need it. 

See the docs for a more detailed explanation:
http://www.postgresql.org/docs/7.4/interactive/runtime-config.html#RUNTIME-CONFIG-RESOURCE

Robert Treat
-- 
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Scaling further up

2004-03-09 Thread Aaron W

I'd look at adding more disks first. Depending on what
type of query
load you get, that box sounds like it will be very
much I/O bound

Given a a 13G database on a 12G system, with a low
growth rate, it is likely that there is almost no I/O
for most activities. The exception is checkpointing.
The first thing I'd do is try to build a spreadsheet
model of:
- select frequency, and # logical and physical reads
involved
- insert/delete/update frequency, and # logical and
physical read and writes involved
- commit frequency, etc.
(start out with simplistic assumptions, and do it for
peak load)
- system events (checkpoints, vacuum)

I assume that the only high I/O you will see will be
for logging. The RAID issue there is basically
obviated by the sequential write nature of WAL. If
that is the case, EMC is not the most cost effective
or obvious  solution - since the value they provide is
mostly manageability for disaster recovery. The goal
in this case is to write at the application max speed,
and with  mimimal latency. Any responsible battery
backed up write through (mirrored) cached controller
can do that for you.

On the other hand, if your requests are not *all*
trivial, you are going to test the hardware and
scheduling algorithms of OS and pg. Even if 0.1% of
3,000 tps take a second - that ends up generating 3
seconds of load Any, even slightly, slow
transactions will generate enormous queues which slow
down everything. 

In most systems of this volume I've seen, the mix of
activities is constantly invalidating cache, making L2
caching less important. Memory to CPU bus speed is a
limiting factor, as well as raw CPU speed in
processing the requests. Xeon is not a great
architecture for this because of FSB contention; I
suspect a 4-way will be completely FSB bottlenecked so
a more than 4 way would likely not change performance.


I would try to get a simple model/benchmark going and
test against it. You should be talking to the big iron
vendors for their take on your issues and get their
capacity benchmarks.

__
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Scaling further up

2004-03-04 Thread Anjan Dave
Great response, Thanks.

Regarding 12GB memory and 13G db, and almost no I/O, one thing I don't
understand is that even though the OS caches most of the memory and PG
can use it if it needs it, why would the system swap (not much, only
during peak times)? The SHMMAX is set to 512MB, shared_buffers is 150MB,
effective cache size is 2GB, sort mem is 2MB, rest is default values. It
also happens that a large query (reporting type) can hold up the other
queries, and the load averages shoot up during peak times.

Regarding a baseline - 

-We have docs and monitor for frequency of sql statements, most
expensive ones, etc. (IronEye)
-I am monitoring disk reads/writes using iostat
-How do I measure commit frequency, and system events like checkpoint?
(vacuum is done nightly during less or no load)

Thanks,
Anjan


-Original Message-
From: Aaron W [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 04, 2004 8:58 AM
To: [EMAIL PROTECTED]; Anjan Dave
Subject: Re: Scaling further up



I'd look at adding more disks first. Depending on what
type of query
load you get, that box sounds like it will be very
much I/O bound

Given a a 13G database on a 12G system, with a low
growth rate, it is likely that there is almost no I/O
for most activities. The exception is checkpointing.
The first thing I'd do is try to build a spreadsheet
model of:
- select frequency, and # logical and physical reads
involved
- insert/delete/update frequency, and # logical and
physical read and writes involved
- commit frequency, etc.
(start out with simplistic assumptions, and do it for
peak load)
- system events (checkpoints, vacuum)

I assume that the only high I/O you will see will be
for logging. The RAID issue there is basically
obviated by the sequential write nature of WAL. If
that is the case, EMC is not the most cost effective
or obvious  solution - since the value they provide is
mostly manageability for disaster recovery. The goal
in this case is to write at the application max speed,
and with  mimimal latency. Any responsible battery
backed up write through (mirrored) cached controller
can do that for you.

On the other hand, if your requests are not *all*
trivial, you are going to test the hardware and
scheduling algorithms of OS and pg. Even if 0.1% of
3,000 tps take a second - that ends up generating 3
seconds of load Any, even slightly, slow
transactions will generate enormous queues which slow
down everything. 

In most systems of this volume I've seen, the mix of
activities is constantly invalidating cache, making L2
caching less important. Memory to CPU bus speed is a
limiting factor, as well as raw CPU speed in
processing the requests. Xeon is not a great
architecture for this because of FSB contention; I
suspect a 4-way will be completely FSB bottlenecked so
a more than 4 way would likely not change performance.


I would try to get a simple model/benchmark going and
test against it. You should be talking to the big iron
vendors for their take on your issues and get their
capacity benchmarks.

__
Do you Yahoo!?
Yahoo! Search - Find what you're looking for faster
http://search.yahoo.com

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Scaling further up

2004-03-04 Thread scott.marlowe
On Thu, 4 Mar 2004, Paul Thomas wrote:

 On 03/03/2004 18:23 scott.marlowe wrote:
  [snip]
  There are three factors that affect how fast you can get to the next
  sector:
  
  seek time
  settle time
  rotational latency
  
  Most drives only list the first, and don't bother to mention the other
  two.
 
 Ah yes, one of my (very) few still functioning brain cells was nagging 
 about another bit of time in the equation :)
 
  On many modern drives, the seek times are around 5 to 10 milliseconds.
  [snip]
 
 Going back to the OPs posting about random_page_cost, imagine I have 2 
 servers identical in every way except the disk drive. Server A has a 10K 
 rpm drive and server B has a 15K rpm drive. Seek/settle times aren't 
 spectacularly different between the 2 drives. I'm wondering if drive B 
 might actually merit a _higher_ random_page_cost than drive A as, once it 
 gets settled on a disk track, it can suck the data off a lot faster. 
 opinions/experiences anyone?

It might well be that you have higher settle times that offset the small 
gain in rotational latency.  I haven't looked into it, so I don't know one 
way or the other, but it seems a reasonable assumption.

However, a common misconception is that the higher angular velocity of 
the 15krpm drives would allow you to read data faster.  In fact, the limit 
of how fast you can read is set by the head.  There's a maximum frequency 
that it can read, and the areal density / rpm have to be such that you 
don't exceed that frequency.  OFten, the speed at which you read off the 
platters is exactly the same between a 10k and 15k of the same family.  

The required lower areal density is the reason 15krpm drives show up in 
the lower capacities first.


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Scaling further up

2004-03-03 Thread Paul Thomas
On 02/03/2004 23:25 johnn wrote:
[snip]
random_page_cost should be set with the following things taken into
account:
  - seek speed
Which is not exactly the same thing as spindle speed as it's a combination 
of spindle speed and track-to-track speed. I think you'll find that a 15K 
rpm disk, whilst it will probably have a lower seek time than a 10K rpm 
disk, won't have a proportionately (i.e., 2/3rds) lower seek time.

  - likelihood of page to be cached in memory by the kernel
That's effective cache size.

--
Paul Thomas
+--+-+
| Thomas Micro Systems Limited | Software Solutions for the Smaller 
Business |
| Computer Consultants | 
http://www.thomas-micro-systems-ltd.co.uk   |
+--+-+

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Scaling further up

2004-03-03 Thread Andrew Sullivan
On Tue, Mar 02, 2004 at 04:50:04PM -0500, Anjan Dave wrote:
 time/resources to do extensive testing,  I am not sure if
 Postgres/Solaris9 is really suggested by the community for
 high-performance, as opposed to a XEON/Linux setup. Storage being a
 separate discussion.

I can tell you from experience that performance on Solaris is nowhere
close to what you'd expect, given the coin you're forking over for
it.  I think the reason to use Solaris is its support for all the
nifty hot-swappable hardware, and not for its speed or any putative
benefit you might get from having 64 bits at your disposal.

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]
The fact that technology doesn't work is no bar to success in the marketplace.
--Philip Greenspun

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Scaling further up

2004-03-03 Thread Paul Thomas
On 03/03/2004 18:23 scott.marlowe wrote:
[snip]
There are three factors that affect how fast you can get to the next
sector:
seek time
settle time
rotational latency
Most drives only list the first, and don't bother to mention the other
two.
Ah yes, one of my (very) few still functioning brain cells was nagging 
about another bit of time in the equation :)

On many modern drives, the seek times are around 5 to 10 milliseconds.
[snip]
Going back to the OPs posting about random_page_cost, imagine I have 2 
servers identical in every way except the disk drive. Server A has a 10K 
rpm drive and server B has a 15K rpm drive. Seek/settle times aren't 
spectacularly different between the 2 drives. I'm wondering if drive B 
might actually merit a _higher_ random_page_cost than drive A as, once it 
gets settled on a disk track, it can suck the data off a lot faster. 
opinions/experiences anyone?

--
Paul Thomas
+--+-+
| Thomas Micro Systems Limited | Software Solutions for the Smaller 
Business |
| Computer Consultants | 
http://www.thomas-micro-systems-ltd.co.uk   |
+--+-+

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] Scaling further up

2004-03-02 Thread William Yu
Anjan Dave wrote:
We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running RH9, 
PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on 4 drives.
 
We are expecting a pretty high load, a few thousands of 'concurrent' 
users executing either select, insert, update, statments.
The quick and dirty method would be to upgrade to the recently announced 
3GHz Xeon MPs with 4MB of L3. My semi-educated guess is that you'd get 
another +60% there due to the huge L3 hiding the Xeon's shared bus penalty.

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Scaling further up

2004-03-02 Thread Fred Moyer
On Tue, 2004-03-02 at 17:42, William Yu wrote:
 Anjan Dave wrote:
  We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running RH9, 
  PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on 4 drives.
   
  We are expecting a pretty high load, a few thousands of 'concurrent' 
  users executing either select, insert, update, statments.
 
 The quick and dirty method would be to upgrade to the recently announced 
 3GHz Xeon MPs with 4MB of L3. My semi-educated guess is that you'd get 
 another +60% there due to the huge L3 hiding the Xeon's shared bus penalty.

If you are going to have thousands of 'concurrent' users you should
seriously consider the 2.6 kernel if you are running Linux or as an
alternative going with FreeBSD.  You will need to load test your system
and become an expert on tuning Postgres to get the absolute maximum
performance from each and every query you have.

And you will need lots of hard drives.  By lots I mean dozen(s) in a
raid 10 array with a good controller.  Thousands of concurrent users
means hundreds or thousands of transactions per second.  I've personally
seen it scale that far but in my opinion you will need a lot more hard
drives and ram than cpu.


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Scaling further up

2004-03-02 Thread Anjan Dave
By lots I mean dozen(s) in a raid 10 array with a good controller. 

I believe, for RAID-10, I will need even number of drives. Currently,
the size of the database is about 13GB, and is not expected to grow
exponentially with thousands of concurrent users, so total space is not
of paramount importance compared to performance.

Does this sound reasonable setup?
10x36GB FC drives on RAID-10
4x36GB FC drives for the logs on RAID-10 (not sure if this is the
correct ratio)?
1 hotspare
Total=15 Drives per enclosure.

Tentatively, I am looking at an entry-level EMC CX300 product with 2GB
RAID cache, etc.

Question - Are 73GB drives supposed to give better performance because
of higher number of platters?

Thanks,
Anjan


-Original Message-
From: Fred Moyer [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 02, 2004 5:57 AM
To: William Yu; Anjan Dave
Cc: [EMAIL PROTECTED]
Subject: Re: [PERFORM] Scaling further up


On Tue, 2004-03-02 at 17:42, William Yu wrote:
 Anjan Dave wrote:
  We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running 
  RH9,
  PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on 4
drives.
   
  We are expecting a pretty high load, a few thousands of 'concurrent'
  users executing either select, insert, update, statments.
 
 The quick and dirty method would be to upgrade to the recently 
 announced
 3GHz Xeon MPs with 4MB of L3. My semi-educated guess is that you'd get

 another +60% there due to the huge L3 hiding the Xeon's shared bus
penalty.

If you are going to have thousands of 'concurrent' users you should
seriously consider the 2.6 kernel if you are running Linux or as an
alternative going with FreeBSD.  You will need to load test your system
and become an expert on tuning Postgres to get the absolute maximum
performance from each and every query you have.

And you will need lots of hard drives.  By lots I mean dozen(s) in a
raid 10 array with a good controller.  Thousands of concurrent users
means hundreds or thousands of transactions per second.  I've personally
seen it scale that far but in my opinion you will need a lot more hard
drives and ram than cpu.


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Scaling further up

2004-03-02 Thread scott.marlowe
On Tue, 2 Mar 2004, Anjan Dave wrote:

 By lots I mean dozen(s) in a raid 10 array with a good controller. 
 
 I believe, for RAID-10, I will need even number of drives.

Correct.

 Currently,
 the size of the database is about 13GB, and is not expected to grow
 exponentially with thousands of concurrent users, so total space is not
 of paramount importance compared to performance.
 
 Does this sound reasonable setup?
 10x36GB FC drives on RAID-10
 4x36GB FC drives for the logs on RAID-10 (not sure if this is the
 correct ratio)?
 1 hotspare
 Total=15 Drives per enclosure.

Putting the Logs on RAID-10 is likely to be slower than, or no faster than 
putting them on RAID-1, since the RAID-10 will have to write to 4 drives, 
while the RAID-1 will only have to write to two drives.  now, if you were 
reading in the logs a lot, it might help to have the RAID-10.

 Tentatively, I am looking at an entry-level EMC CX300 product with 2GB
 RAID cache, etc.

Pick up a spare, I'll get you my home address, etc... :-)

Seriously, that's huge.  At that point you may well find that putting 
EVERYTHING on a big old RAID-5 performs best, since you've got lots of 
caching / write buffering going on.

 Question - Are 73GB drives supposed to give better performance because
 of higher number of platters?

Generally, larger hard drives perform better than smaller hard drives 
because they a: have more heads and / or b: have a higher areal density.

It's a common misconception that faster RPM drives are a lot faster, when, 
in fact, their only speed advantage is slight faster seeks.  The areal 
density of faster spinning hard drives tends to be somewhat less than the 
slower spinning drives, since the maximum frequency the heads can work in 
on both drives, assuming the same technology, is the same.  I.e. the speed 
at which you can read data off of the platter doesn't usually go up with a 
higher RPM drive, only the speed with which you can get to the first 
sector.


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Scaling further up

2004-03-02 Thread Anjan Dave
That was part of my original question - whether it makes sense to go for
a mid-range SunFire machine (64bit HW, 64bit OS), which is scalable to
high amounts of memory, and shouldn't have any issues addressing it all.
I've had that kind of setup once temporarily on a V480 (quad UltraSparc,
16GB RAM) machine, and it did well in production use. Without having the
time/resources to do extensive testing,  I am not sure if
Postgres/Solaris9 is really suggested by the community for
high-performance, as opposed to a XEON/Linux setup. Storage being a
separate discussion.

Thanks,
Anjan

-Original Message-
From: Chris Ruprecht [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 02, 2004 4:17 PM
To: Anjan Dave; [EMAIL PROTECTED]; William Yu
Cc: [EMAIL PROTECTED]
Subject: Re: [PERFORM] Scaling further up


Hi all,

If you have a DB of 'only' 13 GB and you do not expect it to grow much,
it 
might be advisable to have enough memory (RAM) to hold the entire DB in 
shared memory (everything is cached). If you have a server with say 24
GB or 
memory and can allocate 20 GB for cache, you don't care about the speed
of 
disks any more - all you worry about is the speed of your memory and
your 
network connection.
I believe, this not possible using 32-bit technology, you would have to
go to 
some 64-bit platform, but if it's speed you want ...
You can also try solid state hard disk drives. These are actually just
meory, 
there are no moving parts, but the look and behave like very very fast
disk 
drives. I have seen them at capacities of 73 GB - but they didn't
mention the 
price (I'd probably have a heart attack when I look at the price tag).

Best regards,
Chris


On Tuesday 02 March 2004 14:41, Anjan Dave wrote:
 By lots I mean dozen(s) in a raid 10 array with a good controller.

 I believe, for RAID-10, I will need even number of drives. Currently, 
 the size of the database is about 13GB, and is not expected to grow 
 exponentially with thousands of concurrent users, so total space is 
 not of paramount importance compared to performance.

 Does this sound reasonable setup?
 10x36GB FC drives on RAID-10
 4x36GB FC drives for the logs on RAID-10 (not sure if this is the 
 correct ratio)? 1 hotspare
 Total=15 Drives per enclosure.

 Tentatively, I am looking at an entry-level EMC CX300 product with 2GB

 RAID cache, etc.

 Question - Are 73GB drives supposed to give better performance because

 of higher number of platters?

 Thanks,
 Anjan


 -Original Message-
 From: Fred Moyer [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, March 02, 2004 5:57 AM
 To: William Yu; Anjan Dave
 Cc: [EMAIL PROTECTED]
 Subject: Re: [PERFORM] Scaling further up

 On Tue, 2004-03-02 at 17:42, William Yu wrote:
  Anjan Dave wrote:
   We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running

   RH9, PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on 
   4

 drives.

   We are expecting a pretty high load, a few thousands of 
   'concurrent' users executing either select, insert, update, 
   statments.
 
  The quick and dirty method would be to upgrade to the recently 
  announced 3GHz Xeon MPs with 4MB of L3. My semi-educated guess is 
  that you'd get
 
  another +60% there due to the huge L3 hiding the Xeon's shared bus

 penalty.

 If you are going to have thousands of 'concurrent' users you should 
 seriously consider the 2.6 kernel if you are running Linux or as an 
 alternative going with FreeBSD.  You will need to load test your 
 system and become an expert on tuning Postgres to get the absolute 
 maximum performance from each and every query you have.

 And you will need lots of hard drives.  By lots I mean dozen(s) in a 
 raid 10 array with a good controller.  Thousands of concurrent users 
 means hundreds or thousands of transactions per second.  I've 
 personally seen it scale that far but in my opinion you will need a 
 lot more hard drives and ram than cpu.


 ---(end of 
 broadcast)---
 TIP 7: don't forget to increase your free space map settings


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Scaling further up

2004-03-02 Thread Anjan Dave
Here's what I recorded today from iostat (linux, iostat -x -k, sda3 is
the pg slice, logs included) during peak time on the RAID-10 array -
What i see is mostly writes, and sometimes, quite a bit of writing,
during which the average wait times shoot up.

Device:rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/srkB/swkB/s
avgrq-sz avgqu-sz   await  svctm  %util
/dev/sda3   18.81 113.21  3.90 36.33  181.54 1207.7590.77   603.88
34.54 0.490.73   0.22   0.87
/dev/sda30.00 208.00  0.00 150.000.00 2884.00 0.00  1442.00
19.23 0.750.50   0.33   5.00
/dev/sda30.00 239.00  0.00 169.000.00 3264.00 0.00  1632.00
19.31 2.151.27   0.33   5.50
/dev/sda30.00 224.50  0.00 158.000.00 3060.00 0.00  1530.00
19.37 1.901.20   0.28   4.50
/dev/sda30.00 157.00  0.00 117.000.00 2192.00 0.00  1096.00
18.74 0.400.34   0.30   3.50
/dev/sda30.00 249.50  0.00 179.000.00 3596.00 0.00  1798.00
20.0921.40   10.78   0.39   7.00
/dev/sda30.00 637.50  0.00 620.500.00 9936.00 0.00  4968.00
16.01  1137.15  183.55   1.85 115.00
/dev/sda30.00 690.00  0.00 548.500.00 9924.00 0.00  4962.00
18.0943.107.82   0.46  25.50
/dev/sda30.00 485.00  0.00 392.000.00 7028.00 0.00  3514.00
17.9386.90   22.21   1.14  44.50
/dev/sda30.00 312.50  0.00 206.500.00 4156.00 0.00  2078.00
20.13 3.501.69   0.53  11.00
/dev/sda30.00 386.50  0.00 275.500.00 5336.00 0.00  2668.00
19.3716.806.10   0.60  16.50
/dev/sda30.00 259.00  0.00 176.500.00 3492.00 0.00  1746.00
19.78 3.251.84   0.40   7.00
/dev/sda30.00 196.00  0.00 99.000.00 2360.00 0.00  1180.00
23.84 0.100.10   0.10   1.00
/dev/sda30.00 147.00  0.00 100.000.00 1976.00 0.00   988.00
19.76 0.500.50   0.45   4.50
/dev/sda30.00 126.50  0.00 94.500.00 1768.00 0.00   884.00
18.71 0.200.21   0.21   2.00
/dev/sda30.00 133.50  0.00 106.500.00 1920.00 0.00   960.00
18.03 0.500.47   0.47   5.00
/dev/sda30.00 146.50  0.00 118.000.00 2116.00 0.00  1058.00
17.93 0.200.21   0.17   2.00
/dev/sda30.00 156.00  0.00 128.500.00 2276.00 0.00  1138.00
17.71 0.350.27   0.27   3.50
/dev/sda30.00 145.00  0.00 105.000.00 2000.00 0.00  1000.00
19.05 0.250.24   0.24   2.50
/dev/sda30.00  72.96  0.00 54.510.00 1019.74 0.00   509.87
18.71 0.170.31   0.31   1.72
/dev/sda30.00 168.50  0.00 139.500.00 2464.00 0.00  1232.00
17.66 0.650.47   0.39   5.50
/dev/sda30.00 130.50  0.00 100.000.00 1844.00 0.00   922.00
18.44 0.000.00   0.00   0.00
/dev/sda30.00 122.00  0.00 101.000.00 1784.00 0.00   892.00
17.66 0.250.25   0.25   2.50
/dev/sda30.00 143.00  0.00 121.500.00 2116.00 0.00  1058.00
17.42 0.250.21   0.21   2.50
/dev/sda30.00 134.50  0.00 96.500.00 1848.00 0.00   924.00
19.15 0.350.36   0.36   3.50
/dev/sda30.00 153.50  0.00 115.000.00 2148.00 0.00  1074.00
18.68 0.350.30   0.30   3.50
/dev/sda30.00 101.50  0.00 80.000.00 1452.00 0.00   726.00
18.15 0.200.25   0.25   2.00
/dev/sda30.00 108.50  0.00 92.500.00 1608.00 0.00   804.00
17.38 0.250.27   0.27   2.50
/dev/sda30.00 179.00  0.00 132.500.00 2492.00 0.00  1246.00
18.81 0.550.42   0.42   5.50
/dev/sda31.00 113.00  1.00 83.00   16.00 1568.00 8.00   784.00
18.86 0.150.18   0.12   1.00
/dev/sda30.00 117.00  0.00 97.500.00 1716.00 0.00   858.00
17.60 0.200.21   0.21   2.00
/dev/sda30.00 541.00  0.00 415.500.00 7696.00 0.00  3848.00
18.52   146.50   35.09   1.37  57.00
/dev/sda30.00 535.00  0.00 392.500.00 7404.00 0.00  3702.00
18.86   123.70   31.67   1.31  51.50
/dev/sda30.00 993.50  0.00 697.500.00 13544.00 0.00  6772.00
19.42   174.25   24.98   1.25  87.00
/dev/sda30.00 245.00  0.00 108.500.00 2832.00 0.00  1416.00
26.10 0.550.51   0.51   5.50

-Original Message-
From: scott.marlowe [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 02, 2004 4:16 PM
To: Anjan Dave
Cc: [EMAIL PROTECTED]; William Yu; [EMAIL PROTECTED]
Subject: Re: [PERFORM] Scaling further up


On Tue, 2 Mar 2004, Anjan Dave wrote:

 By lots I mean dozen(s) in a raid 10 array with a good controller.
 
 I believe, for RAID-10, I will need even number of drives.

Correct.

 Currently,
 the size of the database is about 13GB, and is not expected to grow 
 exponentially with thousands of concurrent users, so total space is 
 not of paramount importance compared to performance.
 
 Does this sound reasonable setup?
 10x36GB FC drives on RAID-10
 4x36GB FC drives for the logs on RAID-10 (not sure if this is the 
 correct ratio)? 1 hotspare
 Total=15 Drives per enclosure.

Putting the Logs

Re: [PERFORM] Scaling further up

2004-03-02 Thread scott.marlowe
On Tue, 2 Mar 2004, Anjan Dave wrote:

 That was part of my original question - whether it makes sense to go for
 a mid-range SunFire machine (64bit HW, 64bit OS), which is scalable to
 high amounts of memory, and shouldn't have any issues addressing it all.
 I've had that kind of setup once temporarily on a V480 (quad UltraSparc,
 16GB RAM) machine, and it did well in production use. Without having the
 time/resources to do extensive testing,  I am not sure if
 Postgres/Solaris9 is really suggested by the community for
 high-performance, as opposed to a XEON/Linux setup. Storage being a
 separate discussion.

Some folks on the list have experience with Postgresql on Solaris, and 
they generally say they use Solaris not for performance reasons, but for 
reliability reasons.  I.e. the bigger Sun hardware is fault tolerant.

For speed, the X86 32 and 64 bit architectures seem to be noticeable 
faster than Sparc.  However, running Linux or BSD on Sparc make them 
pretty fast too, but you lose the fault tolerant support for things like 
hot swappable CPUs or memory.



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Scaling further up

2004-03-02 Thread johnnnnnn
On Tue, Mar 02, 2004 at 02:16:24PM -0700, scott.marlowe wrote:
 It's a common misconception that faster RPM drives are a lot faster,
 when, in fact, their only speed advantage is slight faster seeks.
 The areal density of faster spinning hard drives tends to be
 somewhat less than the slower spinning drives, since the maximum
 frequency the heads can work in on both drives, assuming the same
 technology, is the same.  I.e. the speed at which you can read data
 off of the platter doesn't usually go up with a higher RPM drive,
 only the speed with which you can get to the first sector.

This would imply that an upgrade in drive RPM should be accompanied by
a decrease in random_page_cost, correct?

random_page_cost should be set with the following things taken into
account:
  - seek speed
  - likelihood of page to be cached in memory by the kernel
  - anything else?


Sorry, i realize this pulls the thread a bit off-topic, but i've heard
that about RPM speeds before, and i just want some confirmation that
my deductions are reasonable.

-johnnn

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Scaling further up

2004-03-02 Thread Jeff
On Mar 2, 2004, at 5:36 PM, scott.marlowe wrote:

Some folks on the list have experience with Postgresql on Solaris, and
they generally say they use Solaris not for performance reasons, but 
for
reliability reasons.  I.e. the bigger Sun hardware is fault tolerant.

Solaris isn't nearly as bad for PG as it used to be.

But as you say - the #1 reason to use sun is reliability. (In my case, 
it was because we had a giant sun laying around :)

I'm trying to remember exactly what happens..  but I know on sun if it 
had a severe memory error it kills off processes with data on that dimm 
(Since it has no idea if it is bad or not.  Thanks to ECC this is very 
rare, but it can happen.).  I want to say if a CPU dies any processes 
running on it at that moment are also killed.  but the more I think 
about that th emore I don't think that is the case.

As for x86.. if ram or a cpu goes bad you're SOL.

Although opterons are sexy you need to remember they really are brand 
new cpus - I'm sure AMD has done tons of testing but sun ultrasparc's 
have been in every situation conceivable in production.   If you are 
going to really have thousands of users you probably want to bet the 
farm on something proven.

lots and lots of spindles
lots and lots of ram
You may also want to look into a replication solution as a hot backup.

--
Jeff Trout [EMAIL PROTECTED]
http://www.jefftrout.com/
http://www.stuarthamm.net/
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Scaling further up

2004-03-02 Thread Rod Taylor
On Tue, 2004-03-02 at 18:24, Anjan Dave wrote:
 Can you describe the vendors/components of a cheap SAN setup?

heh.. Excellent point.

My point was that you could get away with a smaller setup (number of
disks) if it doesn't have to deal with reads and writes are not time
dependent than you will if you attempt to pull 500MB/sec off the disks.

If it is foreseeable that the database can be held in Ram, that it is
much easier and cheaper way to get high IO than with physical disks.



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Scaling further up

2004-03-01 Thread Josh Berkus
Anjan,

 Other than the disks, I am curious what other people are using in terms
 of the horsepower needed. The Quad server has been keeping up, but we
 are expecting quite high loads in the near future, and I am not sure if
 just by having the disks on a high-end storage will do it.

Do a performance analysis of RH9.  My experience with RH on Xeon has been 
quite discouraging lately, and I've been recommending swapping stock kernels 
for the RH kernel.

Of course, if this is RHES, rather than the standard, then test  talk to RH 
instead.

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster